Video Regenerate

An audio-driven lip-sync model that regenerates the lower face from new audio, powering video editing and translation

Xingzhe He Matthew Bendel Stephen W. Bailey Mithilesh Vaidya Sumukh Badam Geoffry Berlin Keith Simmons Vicki Anand Jose Sotelo

Descript, Inc.

Editing · find the edit

regenerated frames

Can you see the edit?

Translation

The mouth follows the new language translated & re-synced

New audio means new lower-face video. The model generates a mouth that matches what's being said while the speaker's identity, lighting, teeth, and the boundary to untouched video stay put. It powers lip-sync for editing and translation in Descript.

The model

Generation happens in a learned latent space. The codec defines that space; the generator paints the lower face inside it.

Video in

→

Latent frames

→

Audio in

References a few frames of you

Generator

Flow-matching transformer

→

Lower-face latents

→

Video out

Component A: The Codec

A causal 3D VAE compresses chunks of video frames into a short sequence of continuous latent frames. Convolutions are causal in time, so the codec supports streaming. Frames can be encoded and decoded as they arrive, and the same encoder handles video and reference images, which puts references and content in one shared latent space; that's what lets the generator concatenate them directly. An auxiliary head aligns the latents with DINO features, anchoring the representation to semantic structure that helps both reconstruction and generation downstream.

Getting the codec right took most of the project's effort, because what the generator can learn is bounded by how modelable the latent space is. Four choices mattered: continuous latents over discrete, because a discrete codec's reconstruction ceiling kept breaking the system; a high-spatial-compression backbone with DCAE residual blocks for better gradient flow at aggressive compression targets; RMSNorm over GroupNorm, since GroupNorm leaks future-frame statistics across the causal boundary, and an entire class of boundary artifacts disappeared when it went; and the DINO alignment above.

The codec is trained with mouth-weighted reconstruction (pixel and perceptual losses with extra weight on the mouth region, which is what gets teeth right, and surprisingly also helps the generator), plus adversarial losses from full-frame and mouth-crop discriminators, the representation-alignment loss, and a KL term annealed in so the model isn't over-regularized while reconstruction is the priority.

Component B: The Generator

A transformer trained with flow matching learns to turn noise into clean lower-face latents. There's no exotic architecture: all the conditioning (surrounding video latents, reference frames, audio) is concatenated into a single stream and fed to a standard transformer. The cleverness is in how the inputs are presented, not in the network.

Audio is encoded through Whisper and compressed to a fixed-length sequence aligned to the latent timeline, since raw audio embeddings are too long to attend over directly. Reference frames, selected from the video itself, carry identity: the timbre of the face, so to speak, in skin, teeth, and lighting. Training uses velocity prediction with masked supervision: the loss is computed only on the lower-face region. At inference, 3-way classifier-free guidance over audio and appearance trades off lip accuracy against visual fidelity independently.

The mechanism that makes all of this honest is aggressive, independent dropout on the previous frame and the references. Without it the model collapses to copying: the output looks plausible and stops responding to audio. Dropping each conditioning stream separately and often forces the model to actually listen. Attention sinks round it out, removing a class of intermittent artifacts at long streaming context.

Video Regenerate

Imputation in an existing video: regenerate a span so the lips follow new audio, and the rest of the take carries through untouched. Pick a clip, watch the mouth, then hit Reveal edit to see which frames were regenerated.

Translation

Synthesize the full performance from references plus new audio: the same speaker, same lighting, a new language. Each pair shows an original recording and its translated version. Watch the mouth.

Impact

Edit without re-shooting

Change a word in the transcript and the lower face follows the new audio. Fixing a flubbed line becomes a text edit, not a second take.

Speak every language

Translation generates a matching performance: the mouth follows the dubbed audio while the speaker stays themselves. The dub stops looking like a dub.

Zero-shot

No per-identity models, no fine-tuning. Identity, lighting, and teeth come from a handful of reference frames taken from the video itself.

We build this with consent and privacy at the center. It exists to translate and repair your own recordings, not to put words in someone else's mouth.

Citation

If you reference this work, please cite:

@misc{he2026videoregenerate,
  title         = {Video Regenerate: Audio-Driven Lip-Sync for Editing and Translation},
  author        = {Xingzhe He and Matthew Bendel and Stephen W. Bailey and
                   Mithilesh Vaidya and Sumukh Badam and Geoffry Berlin and
                   Keith Simmons and Vicki Anand and Jose Sotelo},
  year          = {2026},
  howpublished  = {Descript, Inc.},
  url           = {https://descriptinc.github.io/video-regenerate/}
}