German Gov-Backed AI Benchmark Tracks Large Language Models in 200 Languages

A new multilingual AI benchmarking initiative backed by the German Government aims to advance equitable access to language technologies by highlighting where today’s large language models (LLMs) succeed — and where they fall short.

Jointly developed by the German Federal Ministry for Economic Cooperation and Development (BMZ), development agency GIZ, and the German Research Center for Artificial Intelligence (DFKI), the AI Language Proficiency Monitor tracks the performance of LLMs across up to 200 languages, with a strong focus on low-resource languages.

Hosted on Hugging Face, the public dashboard and leaderboard auto-updates daily, benchmarking both open-source and commercial models. 

It includes model- and language-level scores across a range of tasks — AI translation, question answering, math, classification, and factuality — using popular multilingual datasets such as FLORES+, MMLU, GSM8K, ARC, and TruthfulQA.

“We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance,” said researchers David Pomerenke (BMZ), Jonas Nothnagel (GIZ), and Simon Ostermann (DFKI).

The platform also features a global map of language proficiency, filters for cost-effectiveness, and comparisons by language family or model type. 

The goal: to benchmark systems, support model selection for specific languages and tasks, and guide public investment in digital language infrastructure.

“We aim to provide impact beyond academia by supporting both private- and public-sector decision-making, with a particular emphasis on underserved communities,” the researchers noted.

Still a Prototype, Community Contributions Welcome

They acknowledge that the monitor is still a prototype under active development. While it currently covers major models and widely spoken languages, the analysis remains “far from comprehensive,” sampling just 10 examples per model-task-language combination.

Their initial goal was to demonstrate the feasibility of their approach before exhaustively benchmarking every model, with plans to scale up coverage in future iterations as more compute and community support become available.

To assess real-world usefulness, the team also gathered qualitative feedback from stakeholders across four sectors: industry, SMEs, the public sector, and NGOs. They praised its usability, broad language coverage, and focus on non-European languages. However, they noted that reliance on academic benchmarks may limit real-world relevance and called for more tasks and cross-lingual trend insights — areas the team plans to expand on.

The benchmark is open source, and the researchers encourage contributions from developers aiming to add new models, languages, or tasks. “We invite the broader community to contribute new models, tasks, and languages to further strengthen this shared resource”, they concluded.