In the composite model, the researchers obtained a simpler cross-modality learning that uses speech-text mapping/matching. The training allows the model to perform better and does not require any force-aligned speech and text.
For their methodology, the researchers applied machine translation (MT) and automated speech recognition (ASR) as what they call “auxiliary tasks” in a multi-task learning mode during the optimization of the end-to-end speech translation (ST) model.
Slator Pro Guide: Translation AI
The Slator Pro Guide presents 10 new and impactful ways that LLMs can be used to enhance translation workflows.
Multi-task learning (MTL) mode implies “sharing common knowledge among different tasks” so that the MT task can guide the ST task. However, the researchers stated that, because of the mismatch between speech and text modalities, the guidance was not as effective.
The ComSL model was trained with existing, fine-tuned models, including speech-only input and text-only input, as well as with ST, ASR, and MT as tasks and a “cross-modality learning (CML)” approach based on paired speech-text input instead of forced-alignment.
The training steps consisted of fine-tuning the language model (with all the paired text data), multi-task learning (the tasks were ST, MT, ASR, and CML), regularization on the MT output (fine-tuning with MT tasks), and freezing speech encoder (retaining speech representations at the start of fine-tuning).
400 hours of English
The experiments in this study involved the CoVoST 2 dataset, which comprises translations from 21 languages into English and from English into 15 languages, and approximately 400 hours of English recordings and 900 hours of recordings from 21 additional languages.
The researchers focused mainly on the non-English language into English speech translation, measuring performance with BLEU scores and the CoVoST 2 testing set. The models utilized as the baseline were Whisper and mBART-50, themselves fine-tuned with CoVoST 2.
The composite model was found to outperform the base speech model (Whisper) and the combination of speech and language models (Whisper+mBART). The incorporation of ST data contributed to a high score on the CoVoST2 testing set, and the composite model was also evaluated on speech-to-text translation tasks with better results than those known for the end-to-end modeling that includes the same tasks of ST, ASR, and MT.