MetricX-24 The Google Submission to WMT24

In an October 4, 2024 paper, Google introduced MetricX-24, its latest submission for WMT 2024, an annual conference that focuses on machine translation (MT) research and evaluation.

Building on last year’s MetricX-23, the new metric features a hybrid approach that allows it to perform both traditional reference-based and reference-free evaluations (i.e., quality estimation). This design offers greater flexibility for MT evaluation, making it more adaptable to real-world scenarios where reference translations may be incomplete or unavailable.

According to Google, MetricX-24 is “a hybrid reference-based/-free metric, which can score a translation irrespective of whether it is given the source segment, the reference, or both.”

Dan Deutsch, a Research Scientist at Google Translate, highlighted the importance of developing state-of-the-art evaluation metrics like MetricX-24 in a post on X. He mentioned that this is just one of the many challenges the Google Translate research team is working on, including advancements in both MT and large language models (LLMs). Deutsch also announced he’s hiring for a Research Scientist role at Google Translate.

MetricX-24 is trained using a combination of Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) ratings from previous WMT datasets. The researchers noted, “it is possible to effectively train a metric on a mixture of MQM and DA ratings.”

Additionally, the training process includes examples in three different formats: (1) the source text alongside its translated version, (2) the translated version together with a reference translation, and (3) the source text combined with both the translated version and the reference translation. This mixed approach helps the metric learn from both complete and incomplete reference translations, allowing it to handle situations where references might be of low quality or missing. As the researchers explained, this design “allows the model an opportunity to learn how much weight to put on the source and the reference in different scenarios, or possibly to completely ignore the reference when it is of low quality.”

Enhanced Robustness with Synthetic Data

A key feature of MetricX-24 is its use of synthetic training data, which improves its ability to handle common translation errors like undertranslations, fluent but unrelated translations, and missing punctuation. 

“After seeing the initial benefits from the simple synthetic data used for training MetricX-23, we decided to construct a more comprehensive collection of synthetic training examples,” the researchers noted.

They further emphasized that “adding a relatively small amount of synthetic data to the training set can boost the metric’s performance, especially on lower-quality translations.”

The synthetic training data — generated from DA datasets spanning WMT15 to WMT21 and covering 43 language pairs — addresses new translation issues, such as gibberish translations and missing punctuation, which are typically not well-represented in regular datasets. These synthetic examples train the model to recognize a broader range of translation errors and improve its accuracy on challenging, lower-quality outputs. The researchers noted that MetricX-24 is “significantly more robust to various types of bad translations” compared to its predecessor.

In testing, MetricX-24 outperformed MetricX-23 on the WMT 2023 MQM ratings, showcasing improvements across the board. The researchers reported a “significant performance increase.” 

To support the research community, Google will release MetricX-24’s code and models on GitHub, allowing researchers to explore and adapt the metric further.

Authors: Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and Markus Freitag