Seamless, context-aware audio editing by generative latent inpainting
Most audio edits are obvious: a hard cut, a splice, a re-recorded word that doesn't quite match the rest of the take. Audio Regenerate reframes editing as generative inpainting. Given a recording and a target transcript, the model regenerates only the region that needs to change, while preserving the speaker's identity, intonation, pacing, and room tone from the untouched audio on either side.
The system has two stages. A continuous neural codec compresses audio into a compact stream of latent frames, and a flow-matching transformer fills in masked latent regions conditioned on both the surrounding audio and the desired text. Because generation happens in parallel across the masked span — rather than one sample or token at a time — edits are fast, and the boundaries between original and generated audio are designed to be inaudible.
Editing happens in a learned latent space. The codec defines that space; the generator paints inside it.
A neural audio codec learns to compress high-fidelity speech into a short sequence of continuous latent frames — only a few dozen vectors per second — and to reconstruct the waveform from them with high quality. Working in this compact latent space, rather than on raw samples, is what makes editing tractable: the generator reasons over a handful of frames instead of tens of thousands of samples.
Our codec builds on DAC, our previous state-of-the-art neural audio codec, with two changes that matter for editing. First, it reaches roughly 4× higher temporal compression than DAC — each second of audio becomes far fewer latent frames — so the generator models a much shorter sequence per edit, which is faster and easier to learn while keeping reconstruction quality high. Despite such extreme compression, we were able to outperform DAC in terms of reconstruction quality. Second, we align the latents with a pre-trained semantic latent space, anchoring them to meaningful structure so the representation is more predictable and easier to model.
A third ingredient is how the representation handles loudness. We disentangle signal power from the rest of the content into dedicated channels, which makes the latent space markedly easier to model — speeding up generator convergence and letting classifier-free guidance act only on the power-invariant content, so edits match the level of their surroundings without artifacts.
A transformer trained with flow matching learns to turn noise into clean latent frames. At edit time we mask the region to replace, fill it with noise, and let the model denoise it over a handful of steps — conditioned on the clean latents around it.
Conditioning comes from two sides at once: the surrounding audio anchors voice and acoustics, while the transcript specifies what should be said. Independent guidance on each lets us trade off fidelity to context against faithfulness to the requested words. The whole span is generated in parallel — not left to right — and the model is trained to stitch into untouched audio without seams.
Crucially, this is zero-shot: there are no per-speaker models and no lengthy enrollment prompts. The model infers the speaker's identity — timbre, accent, and delivery — from as little as 5–6 seconds of the adjacent recording. It also carries over the background noise and room tone, so a regenerated word inherits the same acoustic fingerprint as the audio around it, rather than snapping to a clean "studio" voice. That's what makes the edit feel seamless.
Each clip below has had a span regenerated.
Listen first and try to find the edit, then hit Reveal edit to highlight this span.
Notice how the model is able to preserve the speaker's identity, intonation, pacing, and room tone (including reverberation, background noise and even music) from the untouched audio on either side.
This model is especially used for two kinds of edits:
Audio Regenerate ships as the technology behind word-level audio editing in Descript, used by podcasters, video creators, and teams to edit spoken audio as easily as text. To learn more about the technology, check out AI-powered audio repair ↗.
Interested in building the next generation of audio models? We're hiring ↗.
If you reference the codec's power-disentangled representation, please cite:
@misc{luebs2026podar,
title = {PoDAR: Power-Disentangled Audio Representation for Generative Modeling},
author = {Alejandro Luebs and Mithilesh Vaidya and Ishaan Kumar and Sumukh Badam
and Stephen W. Bailey and Matthew Bendel and Jose Sotelo and Xingzhe He},
year = {2026},
eprint = {2605.10084},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2605.10084}
}