NVIDIA, Microsoft, ElevenLabs Top New Automatic Speech Recognition Leaderboard

It also introduces a dedicated long-form transcription track, reflecting real-world use cases such as meetings and podcasts. “A separate long-form evaluation is necessary because some models employ chunking strategies to reduce inference time, which can in turn affect transcription quality,” the researchers explained.

Who’s Leading the Pack

NVIDIA’s NeMo Canary Qwen 2.5b tops the English leaderboard with a 5.63% word error rate (WER), followed by IBM’s Granite Speech 3.3, Microsoft’s Phi-4 Multimodal Instruct, and NVIDIA’s Parakeet.

2025 Cover Slator Pro Guide Translation AI

2025 Slator Pro Guide: Translation AI

The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.

$355 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

In multilingual transcription, Microsoft’s Phi-4 Multimodal Instruct and NVIDIA’s Canary 1B v2 perform the strongest, with average WERs between 3–5% across European languages. Yet the data reveals a familiar trade-off: models optimized for English tend to lose generalization, while multilingual systems slightly trail in English accuracy.

For long-form transcription, ElevenLabs leads with the most accurate results, while RevAI and Speechmatics follow closely. Among open-source models, OpenAI’s Whisper Large v3 ranks highest, with distilled versions offering faster inference.

Open Collaboration

The entire leaderboard infrastructure is open for contributions. Developers can submit new models or datasets through GitHub pull requests, and results update automatically on the Hugging Face Hub.

Authors: Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Koluguri, Piotr Zelasko, Somshubra Majumdar, Adel Moumen, and Sanchit Gandhi

Featured