Audio Regenerate

Abstract

Most audio edits are obvious: a hard cut, a splice, a re-recorded word that doesn't quite match the rest of the take. Audio Regenerate reframes editing as generative inpainting. Given a recording and a target transcript, the model regenerates only the region that needs to change, while preserving the speaker's identity, intonation, pacing, and room tone from the untouched audio on either side.

The system has two stages. A continuous neural codec compresses audio into a compact stream of latent frames, and a flow-matching transformer fills in masked latent regions conditioned on both the surrounding audio and the desired text. Because generation happens in parallel across the masked span — rather than one sample or token at a time — edits are fast, and the boundaries between original and generated audio are designed to be inaudible.

Method

Editing happens in a learned latent space. The codec defines that space; the generator paints inside it.

Audio in

→

Latent frames

→

Text in "… new words …"

Generator

Flow-matching transformer

→

Inpainted masked span

→

Audio out

Component A — The Codec

A neural audio codec learns to compress high-fidelity speech into a short sequence of continuous latent frames — only a few dozen vectors per second — and to reconstruct the waveform from them with high quality. Working in this compact latent space, rather than on raw samples, is what makes editing tractable: the generator reasons over a handful of frames instead of tens of thousands of samples.

Our codec builds on DAC, our previous state-of-the-art neural audio codec, with two changes that matter for editing. First, it reaches roughly 4× higher temporal compression than DAC — each second of audio becomes far fewer latent frames — so the generator models a much shorter sequence per edit, which is faster and easier to learn while keeping reconstruction quality high. Despite such extreme compression, we were able to outperform DAC in terms of reconstruction quality. Second, we align the latents with a pre-trained semantic latent space, anchoring them to meaningful structure so the representation is more predictable and easier to model.

A third ingredient is how the representation handles loudness. We disentangle signal power from the rest of the content into dedicated channels, which makes the latent space markedly easier to model — speeding up generator convergence and letting classifier-free guidance act only on the power-invariant content, so edits match the level of their surroundings without artifacts.

Component B — The Generator

A transformer trained with flow matching learns to turn noise into clean latent frames. At edit time we mask the region to replace, fill it with noise, and let the model denoise it over a handful of steps — conditioned on the clean latents around it.

Conditioning comes from two sides at once: the surrounding audio anchors voice and acoustics, while the transcript specifies what should be said. Independent guidance on each lets us trade off fidelity to context against faithfulness to the requested words. The whole span is generated in parallel — not left to right — and the model is trained to stitch into untouched audio without seams.

Crucially, this is zero-shot: there are no per-speaker models and no lengthy enrollment prompts. The model infers the speaker's identity — timbre, accent, and delivery — from as little as 5–6 seconds of the adjacent recording. It also carries over the background noise and room tone, so a regenerated word inherits the same acoustic fingerprint as the audio around it, rather than snapping to a clean "studio" voice. That's what makes the edit feel seamless.

step 0 / 8

Context frames stay fixed; masked frames start as noise and are refined toward clean latents.

Impact

This model is especially used for two kinds of edits:

Change words. You no longer need to re-record to change what was said — just edit the transcript and the model regenerates those words in the original voice, making otherwise impossible edits possible. We build this with consent and privacy at the center, so the technology is used to repair and polish your own recordings, not to put words in someone else's mouth.
Smooth cuts. Removing a segment often leaves a harsh, abrupt cut where the recording is cut off. Audio Regenerate can regenerate the words on either side of the cut so the join flows naturally — the edit point becomes inaudible instead of jarring.

Audio Regenerate ships as the technology behind word-level audio editing in Descript, used by podcasters, video creators, and teams to edit spoken audio as easily as text. To learn more about the technology, check out AI-powered audio repair ↗.

Interested in building the next generation of audio models? We're hiring ↗.

Can you hear the edit?

Abstract

Method

Component A — The Codec

Component B — The Generator

Samples

Impact

Related Papers & Citation