How “Real” is Real-Time Simultaneous Speech-to-Text Translation

Despite advancements in AI speech translation, many so-called “real-time” systems may not be as real-time as they claim.

A new study finds that much of the research in simultaneous speech-to-text translation (SimulST) is based on unrealistic assumptions that do not reflect real-world conditions — potentially limiting the industry’s ability to deploy truly live, low-latency translation solutions.

In their December 24, 2024 paper, Sara Papi from Fondazione Bruno Kessler and Peter Polák, Ondřej Bojar, and Dominik Macháček from Charles University, reviewed 110 papers on SimulST and found that the majority focus on translating pre-segmented speech — where the input has been manually split into short utterances before translation — rather than continuous, unbounded speech streams.

The researchers argue that this “narrow focus” simplifies the problem by avoiding challenges such as latency, segmentation, and synchronization, ultimately hindering the development of systems that can work in real time without human intervention. 

“Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges,” the researchers said.

Terminological Chaos

They also highlighted inconsistencies in how key terms like “simultaneous,” “streaming,” “online,” and “real-time” are used, making it difficult to compare research findings. “Over 65% of the papers mix and match these terms,” they note, adding that such inconsistencies create “significant ambiguity and confusion in understanding and comparing research work.” The researchers describe this as a “real terminological chaos.” 

To address this, they define SimulST as a six-step process that includes audio acquisition, segmentation, translation, and more — steps crucial for a comprehensive understanding of the SimulST task. Additionally, they propose a standardized terminology and taxonomy for SimulST, which categorizes models based on input type (bounded vs. unbounded speech), architecture (cascade vs. direct), and output strategy (incremental vs. re-translation). 

They argue that “a clear, consistent, and standardized task definition” is needed to ensure future research aligns with real-world requirements.

Concrete Recommendations

Another major gap in the field, according to the researchers, is how SimulST models are evaluated. Most studies rely on pre-segmented input, meaning that latency and quality assessments are not representative of real-time conditions. Popular evaluation tools like SimulEval are designed for sentence-based translation and are “not designed to compute scores for audio streams.”

The researchers stress the need for new evaluation frameworks that account for continuous input and computationally aware latency.

Additionally, while direct models — those that bypass automatic speech recognition and translate speech directly — are becoming more common, they are rarely tested on continuous speech streams, raising questions about their practical effectiveness. The researchers note a marked increase in studies employing direct architectures, “almost tripling from 2021 to 2023”, but emphasize that their performance on long-form speech remains largely untested.

The researchers call for a shift in focus toward systems that can handle continuous audio input without relying on human-prepared segmentation. “Humans will not segment our audio,” they warn, emphasizing that the field needs to develop “holistic systems capable of effectively processing and translating continuous audio streams.”

They also stress the importance of context-aware processing, arguing that SimulST models should retain past information to improve translation accuracy over long speech streams.

With these concrete recommendations, the researchers aim to bridge the gaps in existing literature and advance the field “towards more realistic and effective SimulST solutions.”

The team will present their findings at SlatorCon Remote in March 2025.