Cascades Still Outperform SpeechLLMs in Translation, Study Finds

Large language models (LLMs) are rapidly expanding beyond text, with speech emerging as the next major frontier. From OpenAI and Meta to Nvidia, Alibaba, and Mistral, vendors are racing to build SpeechLLMs — models that can “listen” and translate speech directly, without relying on traditional cascaded pipelines.

But does integrating speech natively into LLMs actually improve speech translation?

A new large-scale study suggests: not yet.

In a December 24, 2025 paper, researchers from Fondazione Bruno Kessler, Barcelona Supercomputing Center, University of Zurich, ETH Zurich, and several European universities introduced a comprehensive benchmark suite designed specifically to evaluate SpeechLLMs against established speech translation architectures.

In total, the researchers tested 21 systems, including five SpeechLLMs such as Voxtral, Qwen2-Audio, and Phi-4-Multimodal; four standalone Speech Foundation Models (SFMs) including Whisper, SeamlessM4T, and Canary; and 12 cascaded pipelines combining ASR models with text-based LLMs like Aya Expanse, Gemma, and Tower+.

These systems were evaluated across 16 benchmarks, spanning 13 language pairs — covering both into-English and out-of-English translation across major European languages and Chinese — and nine real-world conditions, such as noise, accents, disfluencies, code-switching, emotion, and long-form speech.

Cascades Still Dominate

Across most benchmarks, cascaded systems remain “the most reliable” and highest-quality option, the researchers found.

Pipelines pairing strong automatic speech recognition (ASR) models (notably Whisper, Canary, or SeamlessM4T) with powerful text LLMs (like Aya, Gemma 3, or Tower+) consistently delivered the best results across languages and conditions.

Where SpeechLLMs Do Shine

The researchers noted that SpeechLLMs only matched or outperformed cascades in specific conditions, notably:

  • Noisy audio — cascaded systems often hallucinate when ASR performance degrades, and downstream LLMs, lacking access to the original audio, propagate or amplify those errors. SpeechLLMs, which retain access to the original audio signal, proved more robust.
  • Code-switching — speechLLMs handled mixed-language speech more effectively than traditional pipelines.
  • Disfluent speech — some SpeechLLMs showed stronger resilience to hesitations and repetitions.

Among the SpeechLLMs evaluated, Voxtral consistently stood apart. They noted that it was the only SpeechLLM to consistently approach — or occasionally surpass — top cascaded systems, particularly for long-form and complex speech.

Concluding, the researchers stress that “no single paradigm dominates universally.” While SpeechLLMs show growing promise in specific, challenging conditions, cascaded architectures remain the most reliable overall approach for speech translation today.

Authors: Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Machácek, Patricia Schmidtova, and Maike Züfle