How Good Are LLMs for Literary Translation

In an October 24, 2024 paper, researchers Ran Zhang and Steffen Eger from the Natural Language Learning and Generation (NLLG) Lab, along with Wei Zhao from the University of Aberdeen, claimed they demonstrated that in the era of large language models (LLMs), literary translation remains “an exclusive domain of human translators.”

Despite LLM advancements, they found a “substantial gap” between the quality of human and machine-generated literary translations, as “LLMs tend to produce more literal and less diverse translations.” However, researchers have noted that newer LLMs, such as GPT-4o, perform substantially better than older ones.

They involved four student annotators and four professional translators — one for each language pair tested — who expressed a clear preference for human translations over even the best-performing LLMs. Specifically, the annotators rated human outputs as superior to all machine-generated translations, with GPT-4o ranking a close second and “closely approaching human literary translation,” followed by Google Translate and DeepL

The researchers acknowledged that the limited number of annotators and translators, along with the complexity of the task for student annotators, were significant limitations of the study.

They warned that overestimating LLM quality in literary translation could lead companies to misjudge LLM capabilities, potentially resulting in the displacement of human translators, reduced salaries, and a decline in the quality and aesthetic value of translated works.

They underscored the need for evaluation metrics that capture the cultural, stylistic, and creative nuances of literary translation — elements that fall beyond traditional sentence-level machine translation (MT) evaluation.

Gaps in Literary Evaluation

Their research highlights major gaps in current literary MT evaluation methods. Existing automatic metrics generally fail to distinguish between human and machine translations in this domain, while recent LLM-based metrics remain largely untested on literary texts. Moreover, human evaluation frameworks lack standardization for literary texts. They noted that “current studies apply these approaches arbitrarily, resulting in findings that are not directly comparable and potentially leading to unreliable conclusions.” 

Evaluation datasets are also small, limited in scope, and often lack verified human translations, risking an overestimation of LLM capabilities.

To address these gaps, they developed LITEVAL-CORPUS, a dataset of over 2,000 paragraphs and 13,000 sentences from classic and contemporary literary works across four language pairs: English-German, German-English, English-Chinese, and German-Chinese.

This dataset allowed the researchers to examine the effectiveness of annotation schemes used in human evaluation — like Multidimensional Quality Metrics (MQM), Scalar Quality Metric (SQM), and Best-Worst Scaling (BWS) — and assess recent LLM-based metrics, such as XCOMET and GEMBA-MQM.

“Our work provides the first systematic comparison of these evaluation methods for literary texts,” they said.

MQM “Inadequate”

They found MQM to be “inadequate for literary translation,” as it struggled to differentiate human translations from high-quality LLM outputs, potentially leading to “false conclusions regarding the quality of LLMs for literary translation.” 

This shortfall may stem from the MQM framework’s inability to account for intentional stylistic choices in literary translation and its limitations in evaluating the nuanced, high-quality outputs of today’s top models, the researchers highlighted. 

SQM, while effective, depends on the annotator’s expertise, and BWS proved best at distinguishing high-quality human from machine translations, though it does not offer the detailed error insights that MQM offers.

LLM Metrics Biased Towards Own Outputs

Beyond annotation schemes, the researchers evaluated the effectiveness of recent LLM-based metrics for literary translation, particularly focusing on reference-free metrics, as reference translations are scarce for most literary works.

They examined four state-of-the-art (SOTA) metrics: Prometheus 2, XCOMET-XL, XCOMET-XXL — considered “the strongest open-source MT metric for standard MT,” fine-tuned to assess translation quality by generating scores and marking error spans with severity labels — and GEMBA-MQM, a leading prompting-based metric that detects translation quality errors using the MQM framework adapted for LLMs.

While GEMBA-MQM performed best overall, it faced challenges in distinguishing human translations from LLM outputs and was primarily driven by accuracy, struggling with fluency, style, and terminology. XCOMET-XL and XCOMET-XXL followed in rank, showing moderate to poor correlation with human annotations, while Prometheus ranked the lowest.

Interestingly, they found that LLMs, when used as evaluators, showed a preference for more literal translations over human ones and displayed a bias towards their own outputs. 

To support further research, the LITEVAL-CORPUS dataset and code have been made publicly accessible on GitHub.