Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

Matthew Bendel Stephen W. Bailey Mithilesh Vaidya Sumukh Badam Xingzhe He
Descript, Inc.

40+ minute V2V using LTX-2.3

ATS is a training-free scheduler that works with any V2V generator supporting two-sided, anchor-bracketed sampling. Here we showcase ≥ 40-minute generations on LTX-2.3 across five conditioning modalities. Further down, we quantify ATS head-to-head against autoregressive baselines on Wan 2.1 + VACE.

Inpainting
Outpainting
Canny
Depth

Abstract

Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence failures, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color or style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce Anchored Tree Sampling (ATS): a training-free, inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from K sequential rollout steps to L + 1 tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the static-camera regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan 2.1 + VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable ≥ 40-minute generation on LTX-2.3 across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

Method

ATS replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A conditioning-only root call emits sparse anchors over the full horizon, optional refinement levels insert intermediate anchors between them, and leaf calls densely fill each interval bracketed by a pair of anchors. Every non-root call is a bidirectional infill bounded by both endpoints, so errors cannot compound across the horizon. Sibling calls within a level are conditionally independent, so the critical path collapses from K sequential chunks to L + 1 hierarchical steps, with L = O(log T). The construction is training-free: any pretrained two-sided V2V generator (e.g. Wan 2.1 + VACE, LTX-2.3) serves as the base model.

Tree structure of ATS
Figure 1a. Tree structure. Three level chunks side-by-side, each showing the cells visible at that level (conditioning input on the left, generated output on the right).
Sparse-to-dense filling
Figure 1b. Sparse-to-dense filling. Each chunk shows the full timeline twice: input row above (state before this level) and output row below (state after).

Quantitative Results

Long-form quality and drift on Wan 2.1 + VACE, averaged across five 30-minute source videos. Each distilled checkpoint is run twice: rolled out autoregressively (AR) and scheduled inside the tree (ATS). Global AQ and IQ are means over 60 uniformly-sampled keyframes; |ΔM|c is the mean within-chunk drift over the 10 cache-reset windows; |ΔM|r is the mean discontinuity across the 9 reset boundaries.

Checkpoint Sampler Global ↑ Chunk drift ↓ Reset jump ↓
AQIQ |ΔAQ|c|ΔIQ|c |ΔAQ|r|ΔIQ|r
LongLive AR 55.6961.82 3.735.94 4.326.54
ATS (Ours) 59.6968.46 2.511.50 1.771.35
Reward Forcing AR 52.7863.48 3.373.15 3.855.08
ATS (Ours) 58.5169.64 2.441.43 2.081.23
Chunk drift / reset jump curves
Figure 2. Per-chunk |ΔIQ| (left) and per-reset |ΔIQ| (right) averaged across all five conditioning modalities (inpainting, outpainting, edge, pose, depth). ATS stays flat over the full 30-minute horizon while both AR baselines drift within chunks and jump at every cache reset.
Runtime vs. generated duration
Figure 3. Runtime vs. generated duration on Wan 2.1 + VACE. AR baselines scale linearly in T; ATS scales logarithmically thanks to parallel sibling calls. On 8 GPUs, ATS is 5.3× faster than the strongest AR baseline at a 2000-second horizon—faster than realtime.

Qualitative Comparisons

30-minute Wan 2.1 + VACE reels. For each modality we show the source clip alongside two samplers per distilled checkpoint (LongLive and Reward Forcing): autoregressive rollout (AR) and ATS on the same checkpoint.

Pose

Conditioning
LongLive
Reward Forcing
Input
AR
AR
ATS (Ours)
ATS (Ours)

Canny

Conditioning
LongLive
Reward Forcing
Input
AR
AR
ATS (Ours)
ATS (Ours)

Outpaint

Conditioning
LongLive
Reward Forcing
Input
AR
AR
ATS (Ours)
ATS (Ours)

Depth

Conditioning
LongLive
Reward Forcing
Input
AR
AR
ATS (Ours)
ATS (Ours)

Inpaint

Conditioning
LongLive
Reward Forcing
Input
AR
AR
ATS (Ours)
ATS (Ours)

Additional LTX-2.3 Results

Additional LTX-2.3 generations across the same five conditioning modalities. Each row pairs the conditioning input (left) with the ATS output (right).

Inpaint

Input
ATS

Edge

Input
ATS

Outpaint

Input
ATS

Pose

Input
ATS

Depth

Input
ATS

BibTeX

@article{bendel2026ats,
  title   = {Goodbye Drift: Anchored Tree Sampling for Long-Horizon
             Video-to-Video Generation},
  author  = {Bendel, Matthew and Bailey, Stephen W. and Vaidya, Mithilesh
             and Badam, Sumukh and He, Xingzhe},
  year    = {2026},
  institution = {Descript, Inc.},
}