Independent expert annotators were then employed to assess the translations using the MQM schema, providing a comprehensive analysis of translation accuracy and fluency.
The researchers noted that their goal was to better understand large language model (LLM) translators by integrating them into the translation industry, allowing professionals to evaluate their quality against human translators at various levels. This approach provided deeper insights into the systematic differences between LLM-generated translations and human translations, offering a more comprehensive view of LLM translation quality.
As the researchers noted, they are “the first to evaluate LLMs against various levels of professional human translators and analyze the systematic differences between LLMs and human translators.”
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
On Par With Juniors, Behind Seniors
The researchers found that GPT-4 matches junior-level translators in terms of accuracy, but it lags in fluency and stylistic adaptation when compared to senior professionals. Although GPT-4 consistently delivered accurate translations without omissions, additions, or hallucinations, its literal translation style often led to unnatural phrasing, particularly in technical and specialized domains like Technology.
While previous studies raised concerns about hallucinations in large language models, the researchers observed that GPT-4 made almost no hallucination errors across all evaluated directions.
In addition to literal translation, the researchers noted that GPT-4 exhibited weaknesses in grammar and named entity recognition, showing lexical inconsistency. “We observe that GPT-4 exhibits two primary limitations: adherence to overly literal translations and lexical inconsistency,” they said.
Despite these challenges, GPT-4 was noted for maintaining “consistent translation quality across all evaluated language directions,” including in low-resource language pairs — a notable strength compared to traditional NMT systems like SeamlessM4T, which often struggle in such contexts. The researchers pointed out that “GPT-4 mitigates traditional machine translators’ drawback of significant performance gaps from resource-rich to resource-poor directions.”
The researchers concluded that “GPT-4 represents a significant milestone in neural machine translation” and emphasized that “LLMs have the potential to replace human translators, especially junior and medium ones, feasibly.”
Authors: Jianhao Yan, Pingchuan Yan, Yulong Chen, Jing Li, Xianchao Zhu, Yue Zhang