GPT-4 Vs. Human Translators Cont’d

The already extensive list of human translators vs. AI literature continues to grow. A recent study led by researchers from China-based Westlake University, University College London, the University of Cambridge, and China-based language service provider, LanBridge Group, benchmarked GPT-4’s machine translation (MT) performance against human translators of varying expertise levels.

According to the researchers, GPT-4 delivers translation quality comparable to junior- and mid-level human translators but falls short when compared to senior professionals, “indicating machine translation is yet [sic] a solved problem.”

The evaluation covered three language pairs — English↔Chinese, English↔Russian, and Chinese↔Hindi — and three domains: News, Technology, and Biomedicine. The researchers asked junior, medium, and senior human translators — ranked based on their educational background, translation experience, and practical proficiency — to translate source sentences into the target language, alongside GPT-4 and SeamlessM4T.

Independent expert annotators were then employed to assess the translations using the MQM schema, providing a comprehensive analysis of translation accuracy and fluency.

The researchers noted that their goal was to better understand large language model (LLM) translators by integrating them into the translation industry, allowing professionals to evaluate their quality against human translators at various levels. This approach provided deeper insights into the systematic differences between LLM-generated translations and human translations, offering a more comprehensive view of LLM translation quality.

As the researchers noted, they are “the first to evaluate LLMs against various levels of professional human translators and analyze the systematic differences between LLMs and human translators.”

On Par With Juniors, Behind Seniors

The researchers found that GPT-4 matches junior-level translators in terms of accuracy, but it lags in fluency and stylistic adaptation when compared to senior professionals. Although GPT-4 consistently delivered accurate translations without omissions, additions, or hallucinations, its literal translation style often led to unnatural phrasing, particularly in technical and specialized domains like Technology.

While previous studies raised concerns about hallucinations in large language models, the researchers observed that GPT-4 made almost no hallucination errors across all evaluated directions.

In addition to literal translation, the researchers noted that GPT-4 exhibited weaknesses in grammar and named entity recognition, showing lexical inconsistency. “We observe that GPT-4 exhibits two primary limitations: adherence to overly literal translations and lexical inconsistency,” they said.

Despite these challenges, GPT-4 was noted for maintaining “consistent translation quality across all evaluated language directions,” including in low-resource language pairs — a notable strength compared to traditional NMT systems like SeamlessM4T, which often struggle in such contexts. The researchers pointed out that “GPT-4 mitigates traditional machine translators’ drawback of significant performance gaps from resource-rich to resource-poor directions.” 

The researchers concluded that “GPT-4 represents a significant milestone in neural machine translation” and emphasized that “LLMs have the potential to replace human translators, especially junior and medium ones, feasibly.”

Authors: Jianhao Yan, Pingchuan Yan, Yulong Chen, Jing Li, Xianchao Zhu, Yue Zhang