Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

40+ minute V2V using LTX-2.3

ATS is a training-free scheduler that works with any V2V generator supporting two-sided, anchor-bracketed sampling. Here we showcase ≥ 40-minute generations on LTX-2.3 across five conditioning modalities. Further down, we quantify ATS head-to-head against autoregressive baselines on Wan 2.1 + VACE.

Pose

Inpainting

Outpainting

Canny

Depth

Abstract

Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence failures, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color or style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce Anchored Tree Sampling (ATS): a training-free, inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from K sequential rollout steps to L + 1 tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the static-camera regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan 2.1 + VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable ≥ 40-minute generation on LTX-2.3 across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

Method

ATS replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A conditioning-only root call emits sparse anchors over the full horizon, optional refinement levels insert intermediate anchors between them, and leaf calls densely fill each interval bracketed by a pair of anchors. Every non-root call is a bidirectional infill bounded by both endpoints, so errors cannot compound across the horizon. Sibling calls within a level are conditionally independent, so the critical path collapses from K sequential chunks to L + 1 hierarchical steps, with L = O(log T). The construction is training-free: any pretrained two-sided V2V generator (e.g. Wan 2.1 + VACE, LTX-2.3) serves as the base model.

Tree structure of ATS — **Figure 1a.** Tree structure. Three level chunks side-by-side, each showing the cells visible at that level (conditioning input on the left, generated output on the right).

**Figure 1b.** Sparse-to-dense filling. Each chunk shows the full timeline twice: input row above (state before this level) and output row below (state after).

Quantitative Results

Long-form quality and drift on Wan 2.1 + VACE, averaged across five 30-minute source videos. Each distilled checkpoint is run twice: rolled out autoregressively (AR) and scheduled inside the tree (ATS). Global AQ and IQ are means over 60 uniformly-sampled keyframes; |ΔM|_c is the mean within-chunk drift over the 10 cache-reset windows; |ΔM|_r is the mean discontinuity across the 9 reset boundaries.

Checkpoint	Sampler	Global ↑		Chunk drift ↓		Reset jump ↓
Checkpoint	Sampler	AQ	IQ	\|ΔAQ\|_c	\|ΔIQ\|_c	\|ΔAQ\|_r	\|ΔIQ\|_r
LongLive	AR	55.69	61.82	3.73	5.94	4.32	6.54
LongLive	ATS (Ours)	59.69	68.46	2.51	1.50	1.77	1.35
Reward Forcing	AR	52.78	63.48	3.37	3.15	3.85	5.08
Reward Forcing	ATS (Ours)	58.51	69.64	2.44	1.43	2.08	1.23

Chunk drift / reset jump curves — **Figure 2.** Per-chunk |ΔIQ| (left) and per-reset |ΔIQ| (right) averaged across all five conditioning modalities (inpainting, outpainting, edge, pose, depth). ATS stays flat over the full 30-minute horizon while both AR baselines drift within chunks and jump at every cache reset.

**Figure 3.** Runtime vs. generated duration on Wan 2.1 + VACE. AR baselines scale linearly in T; ATS scales logarithmically thanks to parallel sibling calls. On 8 GPUs, ATS is **5.3×** faster than the strongest AR baseline at a 2000-second horizon—faster than realtime.

Qualitative Comparisons

30-minute Wan 2.1 + VACE reels. For each modality we show the source clip alongside two samplers per distilled checkpoint (LongLive and Reward Forcing): autoregressive rollout (AR) and ATS on the same checkpoint.

Pose

Conditioning

LongLive

Reward Forcing

Input

ATS (Ours)

Canny

Conditioning

LongLive

Reward Forcing

Input

ATS (Ours)

Outpaint

Conditioning

LongLive

Reward Forcing

Input

ATS (Ours)

Depth

Conditioning

LongLive

Reward Forcing

Input

ATS (Ours)

Inpaint

Conditioning

LongLive

Reward Forcing

Input

ATS (Ours)

BibTeX

@misc{bendel2026goodbyedriftanchoredtree,
      title={Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation}, 
      author={Matthew Bendel and Stephen W. Bailey and Mithilesh Vaidya and Sumukh Badam and Xingzhe He},
      year={2026},
      eprint={2605.20476},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.20476}, 
}

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

40+ minute V2V using LTX-2.3

Abstract

Method

Quantitative Results

Qualitative Comparisons

Pose

Canny

Outpaint

Depth

Inpaint

Additional LTX-2.3 Results

Inpaint

Edge

Outpaint

Pose

Depth

BibTeX