The project brought together more than 80 contributors from academia, industry, and the public sector to create evaluation tasks directly in Italian, rather than translating existing English benchmarks. The researchers argue that translated benchmarks often miss issues around agreement, register, and context that matter in deployment.
CALAMITA spans 22 challenge areas and nearly 100 subtasks, designed to test different language abilities. The benchmark covers linguistic competence, commonsense reasoning, factual consistency, formal reasoning, fairness, bias, code generation, summarization, and AI translation.
Importantly, the researchers stress that CALAMITA is not about naming a single “best” model, but about showing what credible, language-specific evaluation should look like.
A key deliverable of the initiative is a “centralized evaluation pipeline,” built to support different dataset formats and task-specific metrics. “Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics,” the researchers said.
CALAMITA is also designed to evolve over time, with new tasks and models added as language needs and AI systems change. The researchers describe it as both a “resource” and “a framework for sustainable, community driven evaluation.”
“We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices,” they added.
AI Translation to the Test
CALAMITA explicitly includes AI translation — covering Italian–English and English–Italian translation — among its challenge areas.
The researchers put translation to the test in two ways: one set of tasks focuses on standard bidirectional AI translation, while another evaluates translation under gender-fair and inclusive language constraints, reflecting requirements that increasingly appear in real-world localization workflows.
The results confirm that “LLMs are the state-of-the-art approach” in AI translation and highlight the “superiority” of larger models.
2025 Slator Pro Guide: Translation AI
The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.
The researchers note, however, that the models evaluated are not the most recent. They were selected for their Italian competence and “to demonstrate the benchmark’s structure and interpretative potential.” Future iterations of CALAMITA are expected to expand coverage to newer models — potentially including closed-sourced ones — and support more fine-grained analyses of translation phenomena.
“As LLMs continue to improve, so too must our evaluation frameworks evolve toward methods capable of capturing both general translation performance and more nuanced linguistic and contextual behaviors,” the researchers said.
More broadly, the researchers note that “CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement.”
Following strong community interest in its first edition, they aim to position CALAMITA as a permanent fixture in the Italian NLP landscape, supporting long-term benchmarking and continued community involvement.
Authors: Malvina Nissim, Danilo Croce, Viviana Patti et al. (see the full list of authors here)