Italian Benchmark Evaluates Large Language Models, Includes AI Translation

Large language models (LLMs) have driven rapid progress in natural language processing (NLP), including AI translation. Yet most benchmarks used to evaluate these systems remain heavily English-focused, often relying on translated or synthetic data when other languages are tested — failing to capture the real linguistic challenges models face beyond English.

A December 4, 2025, paper introduces CALAMITA (“Challenging the Abilities of Large Language Models in Italian: a Community Initiative”), a large-scale, community-driven initiative designed to address this gap by evaluating LLMs using Italian-native tasks. The initiative also aims to provide a replicable model for more linguistically grounded AI evaluation.

Coordinated under the Italian Association for Computational Linguistics (AILC), CALAMITA was conceived as a long-term evaluation effort, not a one-off leaderboard exercise.

The project brought together more than 80 contributors from academia, industry, and the public sector to create evaluation tasks directly in Italian, rather than translating existing English benchmarks. The researchers argue that translated benchmarks often miss issues around agreement, register, and context that matter in deployment.

CALAMITA spans 22 challenge areas and nearly 100 subtasks, designed to test different language abilities. The benchmark covers linguistic competence, commonsense reasoning, factual consistency, formal reasoning, fairness, bias, code generation, summarization, and AI translation.

Importantly, the researchers stress that CALAMITA is not about naming a single “best” model, but about showing what credible, language-specific evaluation should look like.

A key deliverable of the initiative is a “centralized evaluation pipeline,” built to support different dataset formats and task-specific metrics. “Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics,” the researchers said.

CALAMITA is also designed to evolve over time, with new tasks and models added as language needs and AI systems change. The researchers describe it as both a “resource” and “a framework for sustainable, community driven evaluation.”

“We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices,” they added.

AI Translation to the Test

CALAMITA explicitly includes AI translation — covering Italian–English and English–Italian translation — among its challenge areas.

The researchers put translation to the test in two ways: one set of tasks focuses on standard bidirectional AI translation, while another evaluates translation under gender-fair and inclusive language constraints, reflecting requirements that increasingly appear in real-world localization workflows.

The results confirm that “LLMs are the state-of-the-art approach” in AI translation and highlight the “superiority” of larger models.

2025 Cover Slator Pro Guide Translation AI

2025 Slator Pro Guide: Translation AI

The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.

$355 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

The researchers note, however, that the models evaluated are not the most recent. They were selected for their Italian competence and “to demonstrate the benchmark’s structure and interpretative potential.” Future iterations of CALAMITA are expected to expand coverage to newer models — potentially including closed-sourced ones — and support more fine-grained analyses of translation phenomena.

“As LLMs continue to improve, so too must our evaluation frameworks evolve toward methods capable of capturing both general translation performance and more nuanced linguistic and contextual behaviors,” the researchers said.

More broadly, the researchers note that “CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement.”

Following strong community interest in its first edition, they aim to position CALAMITA as a permanent fixture in the Italian NLP landscape, supporting long-term benchmarking and continued community involvement.

Authors: Malvina Nissim, Danilo Croce, Viviana Patti et al. (see the full list of authors here)

Featured

Partner spotlight

Boost Language Access

Improve health outcomes and ensure compliance for individuals with LEP

Watch the webinar

Partner spotlight

AI should speak every language

Support linguists building tools that serve marginalized communities.

Donate now

Partner spotlight

memoQ Translation Tech

Enterprise-Grade, AI-Powered and Secure Localization Management for Teams

Discover memoQ

Partner spotlight

Leading with Excellence

globalese by memoQ | 2025 CODiE Award winner for Best Machine Translation.