The system first uses an LLM to generate the translated text and then employs a duration prediction model that takes into account both the text and visual cues from the video, such as the speaker’s lip movements and facial expressions.
The researchers chose GPT-2 for multilingual TTS due to its smaller model size and wider adaptability in state-of-the-art TTS systems.
“Our method utilizes visual cues extracted from the video to achieve duration controllability in GPT-based TTS while maintaining intelligibility and speech quality,” they said.
Slator Pro Guide: Translation AI
The Slator Pro Guide presents 10 new and impactful ways that LLMs can be used to enhance translation workflows.
According to the researchers, DubWise can address the challenging problem of audio-visual alignment after dubbing. They explained that traditional AI dubbing technologies often fail to align dubbed audio with the video, leading to unnatural audio-visual synchronization. This misalignment occurs because TTS-generated speech in the target language often has a different length than the original audio, they added.
First-of-its-Kind Attempt
“This is the first attempt of its kind that utilizes video-based modality for achieving duration controllability in […] LLM-based multimodal TTS,” the researchers stated.
They conducted experiments in both single-speaker and multi-speaker scenarios and used various metrics to evaluate duration control, intelligibility, and lip-sync accuracy.
The researchers say that DubWise outperforms other state-of-the-art methods across various metrics. It achieved improved lip synchronization and naturalness in both same-language and cross-lingual scenarios while maintaining speech intelligibility and quality.
Demo samples are available at https://nirmesh-sony.github.io/DubWise/
Authors: Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah