The researchers argue that this “narrow focus” simplifies the problem by avoiding challenges such as latency, segmentation, and synchronization, ultimately hindering the development of systems that can work in real time without human intervention.
“Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges,” the researchers said.
Terminological Chaos
They also highlighted inconsistencies in how key terms like “simultaneous,” “streaming,” “online,” and “real-time” are used, making it difficult to compare research findings. “Over 65% of the papers mix and match these terms,” they note, adding that such inconsistencies create “significant ambiguity and confusion in understanding and comparing research work.” The researchers describe this as a “real terminological chaos.”
To address this, they define SimulST as a six-step process that includes audio acquisition, segmentation, translation, and more — steps crucial for a comprehensive understanding of the SimulST task. Additionally, they propose a standardized terminology and taxonomy for SimulST, which categorizes models based on input type (bounded vs. unbounded speech), architecture (cascade vs. direct), and output strategy (incremental vs. re-translation).
They argue that “a clear, consistent, and standardized task definition” is needed to ensure future research aligns with real-world requirements.
Concrete Recommendations
Another major gap in the field, according to the researchers, is how SimulST models are evaluated. Most studies rely on pre-segmented input, meaning that latency and quality assessments are not representative of real-time conditions. Popular evaluation tools like SimulEval are designed for sentence-based translation and are “not designed to compute scores for audio streams.”
The researchers stress the need for new evaluation frameworks that account for continuous input and computationally aware latency.
Additionally, while direct models — those that bypass automatic speech recognition and translate speech directly — are becoming more common, they are rarely tested on continuous speech streams, raising questions about their practical effectiveness. The researchers note a marked increase in studies employing direct architectures, “almost tripling from 2021 to 2023”, but emphasize that their performance on long-form speech remains largely untested.
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
The researchers call for a shift in focus toward systems that can handle continuous audio input without relying on human-prepared segmentation. “Humans will not segment our audio,” they warn, emphasizing that the field needs to develop “holistic systems capable of effectively processing and translating continuous audio streams.”
They also stress the importance of context-aware processing, arguing that SimulST models should retain past information to improve translation accuracy over long speech streams.
With these concrete recommendations, the researchers aim to bridge the gaps in existing literature and advance the field “towards more realistic and effective SimulST solutions.”
The team will present their findings at SlatorCon Remote in March 2025.