It includes model- and language-level scores across a range of tasks — AI translation, question answering, math, classification, and factuality — using popular multilingual datasets such as FLORES+, MMLU, GSM8K, ARC, and TruthfulQA.
“We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance,” said researchers David Pomerenke (BMZ), Jonas Nothnagel (GIZ), and Simon Ostermann (DFKI).
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
The platform also features a global map of language proficiency, filters for cost-effectiveness, and comparisons by language family or model type.
The goal: to benchmark systems, support model selection for specific languages and tasks, and guide public investment in digital language infrastructure.
“We aim to provide impact beyond academia by supporting both private- and public-sector decision-making, with a particular emphasis on underserved communities,” the researchers noted.
They acknowledge that the monitor is still a prototype under active development. While it currently covers major models and widely spoken languages, the analysis remains “far from comprehensive,” sampling just 10 examples per model-task-language combination.
Their initial goal was to demonstrate the feasibility of their approach before exhaustively benchmarking every model, with plans to scale up coverage in future iterations as more compute and community support become available.
To assess real-world usefulness, the team also gathered qualitative feedback from stakeholders across four sectors: industry, SMEs, the public sector, and NGOs. They praised its usability, broad language coverage, and focus on non-European languages. However, they noted that reliance on academic benchmarks may limit real-world relevance and called for more tasks and cross-lingual trend insights — areas the team plans to expand on.
The benchmark is open source, and the researchers encourage contributions from developers aiming to add new models, languages, or tasks. “We invite the broader community to contribute new models, tasks, and languages to further strengthen this shared resource”, they concluded.