Large Language Models Struggle to Evaluate Long AI Translations, Amazon Finds

A new study from Amazon has revealed a limitation in using large language models (LLMs) to evaluate AI translation quality: performance drops as input length increases.

While LLMs are increasingly used for high-quality sentence-level AI translation evaluation, the study finds that these models become “less reliable when evaluating long-form translation outputs.” 

Amazon researchers Tobias Domhan and Dawei Zhu define long-form as longer spans of text — such as paragraphs, full documents, or even batches of documents — as opposed to short-form, which refers to a single sentence or a few sentences.

“Ideally, evaluation should be invariant to text length,” they noted. “However, our analysis shows that text length significantly impacts evaluation,” they added.

They found that when full documents or multi-document inputs were evaluated as a whole, LLMs detected fewer translation errors, missing many issues that were caught when segments were evaluated separately. Additionally, they observed a drop in the LLMs’ ability to accurately rank AI translation systems, undermining their reliability for benchmarking.

More Refined Approaches

To address this limitation and reliably assess long-form translation, “more refined approaches” are necessary, according to the researchers, who proposed several prompting and fine-tuning strategies:

  • Granularity-aligned prompting — selects examples for in-context learning that match the length of the input being evaluated.
  • Focus Sentence Prompting (FSP) — presents the LLM with the full source and translation, but instructs it to evaluate one sentence at a time.
  • Fine-tuning — adapts the LLM using custom training data aligned to evaluation tasks.

Domhan and Zhu found that “the latter two methods largely mitigate this length bias, making LLMs more reliable for long-form translation evaluation.” 

While granularity-aligned prompting offers some improvement, it still falls short on longer inputs — suggesting that adjusting example size alone is not enough to overcome the length bias.

In contrast, FSP yields better results. It helps preserve context while encouraging consistent, fine-grained error detection. It improves both error span detection and system ranking accuracy, though it comes with increased inference costs.

Fine-tuning proves to be the most effective approach. It significantly improves performance across all input text lengths — even with a small amount of training data.

Domhan and Zhu recommend using FSP as a strong baseline for long-form evaluation when using off-the-shelf models and advocate fine-tuning where possible for more robust, production-ready evaluation workflows.