Although work on so-called “direct” S2ST, which bypasses ASR and MT, is limited, there are a number of benefits, including greater ease in working with languages that lack a written form and in handling content that does not require translation, such as proper nouns.
Slator 2021 Language Industry Market Report
80-pages. Market Size by Vertical, Geo, Intention. Expert-in-Loop Model. M&A. Frontier Tech. Hybrid Future. Outlook 2021-2025.
Translatotron 2 comprises three parts, connected by an attention module: a source speech encoder; a target phoneme decoder; and a target mel-spectrogram synthesizer. The model is jointly trained with a speech-to-speech translation objective and a speech-to-phoneme translation objective.
The original Translatotron could generate translated speech in a different voice using either a clip of the target speaker’s audio (as reference audio for the speaker encoder) or the embedding of the target speaker. While this capability is potentially useful in industries such as film and gaming, it also made Translatotron “ripe for potential misuse.”
Translatotron 2 takes a different approach to prevent its use in deepfakes. The trained model is restricted to retaining the source speaker’s voice, and the model cannot generate speech in a different speaker’s voice.
Another related improvement is the ability to retain original voices for “speaker turns,” which the authors noted would be challenging for cascade systems. Using as a starting point a TTS model that preserves voices through translation, researchers augmented training data so that Translatotron 2 could learn on examples with speaker turns.
The researchers added that these kinds of modifications can increase “the diversity of the speech content as well as the complexity of the acoustic conditions in the training examples, which can further improve the translation quality of the model, especially on small datasets.”