Stanford and UC Santa Cruz Launch Benchmark for Audio-Language Models

AHELM instead aggregates a diverse collection of existing datasets, supplemented with new ones, to evaluate models across ten dimensions: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. This makes it, according to the authors, the most “holistic evaluation” framework for ALMs to date.

To fill gaps left by existing resources, the team also created two new synthetic audio-text datasets. PARADE tests whether ALMs reproduce occupational or social stereotypes based on voice characteristics, while CoRe-Bench measures models’ ability to reason over multi-turn conversational audio that reflects diverse demographics and scenarios.

Slator 2025 AI Dubbing Report

The 85-page report analyzes the supply and demand for AI dubbing and the technical and operational nuances in delivering AI dubbing across verticals.

$690 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

No Single Model Excels Across All Scenarios

The researchers evaluated 14 ALMs from Google (Gemini), OpenAI (GPT-4o Audio), and Alibaba (Qwen), along with three baseline systems that combine automatic speech recognition (ASR) with a text-based large language model (LLM).

They found that “there is no single model that excels across all scenarios.”

Among the ALMs, Gemini 2.5 Pro ranked highest overall, leading in five of the ten categories, but also displayed fairness issues in speech recognition, performing differently depending on speaker gender. Open-weight models such as Qwen lagged behind in instruction following, often failing to comply with strict prompts.

Perhaps most surprising was the strength of the baseline systems. In particular, GPT-4o Transcribe + GPT-4o ranked sixth overall, outperforming nine ALMs despite being built on a relatively simple ASR-plus-LLM pipeline. The researchers attribute this to the superior robustness of dedicated ASR modules in noisy conditions, which remain a challenge for current ALMs.

A ‘Living Benchmark’

The researchers note that “AHELM evaluates ALMs using a standardized set of prompts, scenarios, and metrics,” enabling researchers, developers, and decision-makers to better understand the strengths and weaknesses of different systems.

They also emphasize that AHELM is a “living benchmark,” with plans to expand to more models, scenarios, and metrics over time.

In line with their commitment to transparent and reproducible science, the researchers have released all prompts, raw outputs, and results, along with the code on GitHub.

Authors: Tony Lee, Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, and Percy Liang

Featured