Specifically, they utilized MQM annotations from the WMT 2022 Chat Shared Task, which included real-life bilingual customer support conversations translated by automatic machine translation (MT) systems submitted by participants.
Human experts at Unbabel, specifically trained in evaluating customer support content using the MQM framework, assessed the translations. The evaluations were conducted by Unbabel’s team of expert linguists and translators, who considered the complete conversational context during the assessment.
Slator Pro Guide: Translation AI
The Slator Pro Guide presents 10 new and impactful ways that LLMs can be used to enhance translation workflows.
Room for Improvement
They discovered that reference-based metrics, such as COMET-22 and METRICX- 23-XL, outperformed reference-free metrics, such as METRICX-23-QE-XL and COMET- 20-QE, especially for translations in languages other than English, suggesting that there is “room for improvement for reference-free evaluation for assessing translations in languages other than English.”
By incorporating contextual information, the correlation with human judgments improved, particularly for reference-free COMET-20-QE in non-English translations. However, adding context had a negative impact on evaluating translations in English.
The researchers explored two types of contextual information for evaluating translation quality: within and across participants. In a typical chat conversation, there are usually two participants: a customer and an agent. In the case where the text is generated by a customer, it can be preceded by context from previous interactions by the same participant (i.e., the customer) (within) or by considering the context from both participants (i.e., the customer and the agent) (across).
Bilingual Context Improves Evaluation
They also investigated the use of large language models (LLMs) for assessing chat translation quality and introduced CONTEXT-MQM, an LLM-based metric that utilizes context to enhance evaluation. Initial experiments showed promising results in enhancing the quality assessment of machine-translated chats.
“Our preliminary experiments with CONTEXT-MQM show that adding bilingual context to the evaluation prompt indeed helps improve the quality assessment of machine-translated chats,” they said.
The researchers highlighted the potential of using LLMs to evaluate the quality of chat translations with contextual information. Furthermore, exploring alternative prompting strategies to include context across various language pairs and LLMs is deemed necessary for future research, they said.
Authors: Sweta Agrawal, Amin Farajian, Patrick Fernandes, Ricardo Rei, André F.T. Martins