What are the main advantages?
First, AV2AV offers synchronized lip movements along with the translated speech, simulating real face-to-face conversations and providing a more immersive dialogue experience. Second, it enhances the robustness of the spoken language translation system by leveraging complementary information from audio and visual speech, ensuring accurate translations even in the presence of acoustic noise.
Additionally, the authors suggest that an AV2AV approach provides a faster and more cost-effective solution for audio-visual speech translation compared to traditional 4-stage cascaded speech to audio-visual speech translation approaches, which involve a sequential process of automatic speech recognition (ASR), neural machine translation (NMT), text-to-speech synthesis (TTS), and audio-driven talking face generation (TFG).
Slator Pro Guide: Translation AI
The Slator Pro Guide presents 10 new and impactful ways that LLMs can be used to enhance translation workflows.
Increased Demand and Effectiveness
The authors stressed that “in today’s world, where millions of multimedia content pieces are generated daily and shared globally in diverse languages, the demand for systems like the proposed AV2AV is anticipated to increase.”
However, developing a direct AV2AV system is challenging due to the lack of existing data for training. While text and speech datasets are abundant, there is a scarcity of parallel audio-visual speech data. “As there is no available AV2AV translation data, it is not feasible to train our model in a parallel AV2AV data setting,” they said.
They explained that one approach to address this challenge would be to generate this data artificially by creating speech and video separately. However, they acknowledged that this method may not yield optimal results due to limitations in accurately replicating lip movements. Instead, they demonstrated that the proposed AV2AV framework can be trained using audio-only data to facilitate translation between AV speech.
Moreover, as the proposed AV2AV can be trained without using text data, the authors noted that the system can serve languages with no writing systems.
“The demand for systems like the proposed AV2AV is anticipated to increase.”
The effectiveness of AV2AV was validated through extensive experiments in a many-to-many language translation setting. Since there was no previous method that could perform AV2AV, the authors compared its performance with the state-of-the-art direct audio-visual speech-to-speech translation model, AV-TranSpeech. The results showed that the proposed method is “much more effective” than AV-Transpeech, especially in the low-resource setting.
A demo page showcasing the AV2AV system is available at choijeongsoo.github.io/av2av.