The Problem with Evaluating State-of-the-Art AI Translation

In Estimating Machine Translation Difficulty (August 13, 2025), researchers Lorenzo Proietti, Stefano Perrella, and Roberto Navigli from Sapienza University (Sapienza NLP Group), along with Vilém Zouhar from ETH Zurich and Tom Kocmi from Cohere, highlighted a problem in evaluating state-of-the-art AI translation systems.

They explained that leading AI translation systems receive “near-perfect scores” on widely used benchmarks such as the WMT shared tasks, performing “close to human level.” At WMT 2024, for example, top systems produced translations that human evaluators rated at 90 to 100 almost across the board.

According to the researchers, this is because current test sets are simply “too easy” for today’s models.

While impressive on paper, this ceiling effect creates a challenge: if all systems look equally good, it becomes increasingly difficult for researchers to track progress and for enterprise buyers to make informed choices between vendors.

The industry risks assuming AI translation is solved when, in fact, weaknesses remain hidden by overly easy test sets.

Creating More Challenging Benchmarks

To address this, the researchers propose creating more discriminative benchmarks by automatically selecting harder samples.

Instead of simply judging translations, they suggest predicting how difficult a text will be to translate beforehand.

They formalize this as a new task called translation difficulty estimation — they define a text’s difficulty based on the expected quality of its translations (the lower the score, the higher the difficulty) — with a dedicated metric, difficulty estimation correlation (DEC), which measures how well systems rank texts by difficulty compared to human judgments.

In practice, this means building test sets around texts that are genuinely challenging for AI translation systems, rather than confirming strengths on simpler cases. By identifying samples where AI translation models still struggle, it is possible to “expose their shortcomings and guide improvements in future iterations,” the researchers explained.

Which Difficulty Estimators Work Best?

They compared four types of difficulty estimators:

Heuristics — such as sentence length, word rarity, and syntactic complexity
Learned models — trained directly to predict difficulty, including their own Sentinel-src series.
LLM-as-a-Judge methods — large language models (LLMs) like GPT-4o or Cohere’s CommandA prompted to score difficulty.
Crowd-based approaches — which generate translations from several models and score them with reference-less metrics like XCOMET or MetricX. This is discussed in the PDF How to Select Datapoints for Efficient Human Evaluation of NLG Models?

MAIN IMAGE - 2025 Language Industry Market Report

Slator 2025 Language Industry Market Report

The 150-page report offers a comprehensive view of the 2025 global market — with market sizing, AI capability breakdowns, buyer insights, use cases, survey data, and projections through 2030.

$970 BUY NOW Included in our Growth, Pro, and
Enterprise plans. Subscribe now!

They found that LLMs such as OpenAI’s GPT-4o and Cohere’s CommandA, when used as “judges,” performed poorly, in some cases even worse than simple length-based heuristics.

Traditional heuristics, including word rarity and syntactic complexity, also proved weak, failing to capture the nuances of translation difficulty.

Crowd methods delivered stronger results but are computationally expensive and not practical for everyday use.

By contrast, the learned models consistently delivered the best results. The Sentinel-src family, particularly the newly released Sentinel-src-24 and Sentinel-src-25, was able to identify difficult texts with accuracy comparable to crowd-based approaches, but without heavy resource requirements, making them more viable for widespread adoption.

Another notable finding is that humans and machines often disagree on what counts as “difficult” and do not struggle with the same texts, underscoring the importance of designing test sets that reflect both human and machine challenges.

The researchers have publicly released two models, Sentinel-src-24 and Sentinel-src-25, on Hugging Face, making them available to scan large corpora and identify the texts most likely to expose weaknesses in AI translation.

Read more articles on AI Translation and industry insights.

Featured