To address these issues, they created a new benchmark called CONTRAPROST (Contrastive Prosody Speech Translation), specifically designed to test prosody awareness in S2TT systems. Leveraging large language models and controllable text-to-speech, they generated prosody-rich data and built “double-contrastive” examples — English sentences that vary in prosody, producing distinct interpretations based on tone and emphasis.
CONTRAPROST evaluates model performance across five key prosodic features: sentence stress, prosodic breaks, intonation patterns, emotional tone, and politeness, and, being largely automated, it can be expanded to support broader analyses.
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
“We take steps toward a reliable and comprehensive evaluation methodology, which is one of the most important prerequisites for achieving prosody-aware S2TT,” the researchers stated.
Apple’s Call for More Prosody-Aware Training Data
Using CONTRAPROST, the team tested both end-to-end models, which access audio signals directly, and cascaded systems, which rely on a separate transcription stage, including Meta’s SEAMLESSM4T.
They observed that while S2TT models have some internal representation of prosody, “the prosody signal is often not strong enough to affect the translations.” Although end-to-end models slightly outperformed cascaded systems in prosody-specific evaluations, neither model type consistently applied prosodic cues in ways that improved translation quality.
To improve prosody awareness, the researchers suggest that S2TT systems would benefit from training on more prosody-rich data, potentially enabling models to become more contextually aware and better at capturing speaker tone, emotional nuance, and intent. “The most important implication of our findings is the need for exploring improvements of S2TT regarding prosody-awareness,” the team emphasized, underscoring the need for further development in this area.
“We hope that our benchmark and findings will motivate more research into prosody-aware S2TT in the future, enabling us to better understand it and improve it,” they concluded.
Authors: Ioannis Tsiamas, Matthias Sperber, Andrew Finch, Sarthak Garg