Large Language Models Struggle to Evaluate Long AI Translations, Amazon Finds

“Ideally, evaluation should be invariant to text length,” they noted. “However, our analysis shows that text length significantly impacts evaluation,” they added.

They found that when full documents or multi-document inputs were evaluated as a whole, LLMs detected fewer translation errors, missing many issues that were caught when segments were evaluated separately. Additionally, they observed a drop in the LLMs’ ability to accurately rank AI translation systems, undermining their reliability for benchmarking.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

$365 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

More Refined Approaches

To address this limitation and reliably assess long-form translation, “more refined approaches” are necessary, according to the researchers, who proposed several prompting and fine-tuning strategies:

Granularity-aligned prompting — selects examples for in-context learning that match the length of the input being evaluated.
Focus Sentence Prompting (FSP) — presents the LLM with the full source and translation, but instructs it to evaluate one sentence at a time.
Fine-tuning — adapts the LLM using custom training data aligned to evaluation tasks.

Domhan and Zhu found that “the latter two methods largely mitigate this length bias, making LLMs more reliable for long-form translation evaluation.”

While granularity-aligned prompting offers some improvement, it still falls short on longer inputs — suggesting that adjusting example size alone is not enough to overcome the length bias.

In contrast, FSP yields better results. It helps preserve context while encouraging consistent, fine-grained error detection. It improves both error span detection and system ranking accuracy, though it comes with increased inference costs.

Fine-tuning proves to be the most effective approach. It significantly improves performance across all input text lengths — even with a small amount of training data.

Domhan and Zhu recommend using FSP as a strong baseline for long-form evaluation when using off-the-shelf models and advocate fine-tuning where possible for more robust, production-ready evaluation workflows.

Featured