A study by researchers from Google and Boston University, presented in July at the 42nd international conference on machine learning (ICML) in Vancouver, has found that even small amounts of contamination in training data can lead to significant overestimation of AI translation performance in large language models (LLMs).
Data contamination refers to the accidental inclusion of evaluation examples — either partially or fully — in pre-training data. This affects the evaluation results on widely used benchmarks and undermines their validity, as models are no longer tested on truly unseen data.
Using 1B and 8B parameter models trained on multilingual data, the researchers found that when both source and target sides of test examples were included in the training data, BLEU scores could be inflated by up to 30 points in larger (8B) models. Smaller models showed more modest gains, with performance overestimation roughly 2.5 times lower. “Larger models exhibit increased sensitivity to even a single copy of contamination,” they noted.

