Sony’s DubWise Uses Visual Cues from the Video to Improve AI Dubbing

The system first uses an LLM to generate the translated text and then employs a duration prediction model that takes into account both the text and visual cues from the video, such as the speaker’s lip movements and facial expressions.

The researchers chose GPT-2 for multilingual TTS due to its smaller model size and wider adaptability in state-of-the-art TTS systems.

“Our method utilizes visual cues extracted from the video to achieve duration controllability in GPT-based TTS while maintaining intelligibility and speech quality,” they said.

Slator Pro Guide: Translation AI

The Slator Pro Guide presents 10 new and impactful ways that LLMs can be used to enhance translation workflows.

$290 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

According to the researchers, DubWise can address the challenging problem of audio-visual alignment after dubbing. They explained that traditional AI dubbing technologies often fail to align dubbed audio with the video, leading to unnatural audio-visual synchronization. This misalignment occurs because TTS-generated speech in the target language often has a different length than the original audio, they added.

First-of-its-Kind Attempt

“This is the first attempt of its kind that utilizes video-based modality for achieving duration controllability in […] LLM-based multimodal TTS,” the researchers stated.

They conducted experiments in both single-speaker and multi-speaker scenarios and used various metrics to evaluate duration control, intelligibility, and lip-sync accuracy.

The researchers say that DubWise outperforms other state-of-the-art methods across various metrics. It achieved improved lip synchronization and naturalness in both same-language and cross-lingual scenarios while maintaining speech intelligibility and quality.

Demo samples are available at https://nirmesh-sony.github.io/DubWise/

Authors: Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah

Featured