VoiceCraft-Dub: Automated Video Dubbing
with Neural Codec Language Models

1POSTECH,  2KAIST,  3University of Texas at Austin
Interpolate start reference image.

Automated video dubbing. (a) Unlike text-to-speech, which generates diverse speech based on target text, automated video dubbing requires synthesized speech to be temporally and expressively aligned with the video while maintaining naturalness and intelligibility. (b) Examples of synthesized speech from VoiceCraft-Dub show that each speech is aligned with the lip movements of the input video. We strongly encourage listening to each of the samples below.

Comparison results on our curated CelebV-Dub dataset


Comparison results on LRS3 dataset

Comparison results on cross-speaker synthesis on the LRS3 dataset