Overestimation in AI Translation Evaluation

A study by researchers from Google and Boston University, presented in July at the 42nd international conference on machine learning (ICML) in Vancouver, has found that even small amounts of contamination in training data can lead to significant overestimation of AI translation performance in large language models (LLMs).

Data contamination refers to the accidental inclusion of evaluation examples — either partially or fully — in pre-training data. This affects the evaluation results on widely used benchmarks and undermines their validity, as models are no longer tested on truly unseen data.

Using 1B and 8B parameter models trained on multilingual data, the researchers found that when both source and target sides of test examples were included in the training data, BLEU scores could be inflated by up to 30 points in larger (8B) models. Smaller models showed more modest gains, with performance overestimation roughly 2.5 times lower. “Larger models exhibit increased sensitivity to even a single copy of contamination,” they noted.

Partial contamination – such as including only the source or target text — had inconsistent and generally limited impact. “Contamination involving only one side of the parallel data appears to be less critical,” the researchers said.

They also emphasized that the timing of contamination also matters. When contaminated examples appeared early in training, models showed a sharp performance spike that faded over time. In contrast, contamination introduced later had a more lasting impact. Most notably, when contamination was spread evenly throughout training — a scenario that reflects how contamination typically occurs in practice — the inflation in BLEU scores was both stronger and more persistent.

The impact of contamination was also found to be uneven across languages. The researchers found no significant performance boost for languages absent from the pretraining data, suggesting that some level of language representation is necessary for contamination to have an effect. Additionally, contamination has a more significant impact on the En→X translation direction compared to X→En.

These findings add to a growing body of research calling into question the reliability of AI translation benchmarks. As previously reported by Slator, Google has flagged data quality issues in multilingual speech datasets, highlighted the limitations of single-metric evaluation, and called for better multilingual LLM evaluation strategies.

“This work sheds light on the nuanced ways in which data contamination affects model performance, and underscores the need for more reliable evaluation practices in large language model development,” the researchers concluded.