It also introduces a dedicated long-form transcription track, reflecting real-world use cases such as meetings and podcasts. “A separate long-form evaluation is necessary because some models employ chunking strategies to reduce inference time, which can in turn affect transcription quality,” the researchers explained.
Who’s Leading the Pack
NVIDIA’s NeMo Canary Qwen 2.5b tops the English leaderboard with a 5.63% word error rate (WER), followed by IBM’s Granite Speech 3.3, Microsoft’s Phi-4 Multimodal Instruct, and NVIDIA’s Parakeet.
2025 Slator Pro Guide: Translation AI
The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.
In multilingual transcription, Microsoft’s Phi-4 Multimodal Instruct and NVIDIA’s Canary 1B v2 perform the strongest, with average WERs between 3–5% across European languages. Yet the data reveals a familiar trade-off: models optimized for English tend to lose generalization, while multilingual systems slightly trail in English accuracy.
For long-form transcription, ElevenLabs leads with the most accurate results, while RevAI and Speechmatics follow closely. Among open-source models, OpenAI’s Whisper Large v3 ranks highest, with distilled versions offering faster inference.
Open Collaboration
The entire leaderboard infrastructure is open for contributions. Developers can submit new models or datasets through GitHub pull requests, and results update automatically on the Hugging Face Hub.
Authors: Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Koluguri, Piotr Zelasko, Somshubra Majumdar, Adel Moumen, and Sanchit Gandhi