Each question is answered with a simple “Yes,” “No,” or “Maybe,” which the system maps to scores of 1, 0, or 0.5, respectively. LITRANSPROQA computes an overall quality score as an average — with the option to weight results based on translator input.
“Unlike existing QA-based MT metrics, LITRANSPROQA focuses on core elements in literary translation proposed and verified by researchers and experienced professional literary translators,” the researchers noted.
Rather than rely on LLMs to generate evaluation questions, the researchers sourced them from literary translation theory, translator interviews, and training materials, explaining that “LLMs are not yet fully trustworthy for automatic question generation […] in the literary domain.”
The questions were refined in collaboration with experienced literary translators. From an initial list of 45 questions, they selected 25 that experts found relevant and LLMs could meaningfully respond to. The researchers highlighted that “LITRANSPROQA reflects professional translators’ quality control and assessment process.”
Human-Level Evaluation Capabilities
The researchers benchmarked LITRANSPROQA against widely used metrics including XCOMET-XL, COMET-KIWI, and GEMBA-MQM.
They found that it outperformed them in terms of correlation with human judgments, and was particularly effective in distinguishing professional human translations from AI translations.
“LITRANSPROQA demonstrates substantial progress toward human-level evaluation capabilities,” the researchers said, noting that its performance approached that of trained student annotators.
Well-Suited for Copyrighted Texts
A key advantage of LITRANSPROQA is its compatibility with open-source LLMs, including Meta’s LLaMA3.3-70B and Alibaba’s Qwen2.5-32B. These models matched — and in some cases outperformed — proprietary options such as GPT-4o-mini.
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
This reduces its reliance on proprietary technology and makes it well-suited for use cases involving sensitive or copyrighted content, where local processing and data control are essential.
This demonstrates the framework’s “broad applicability” and “value as an accessible, training-free metric for evaluating literary texts — particularly those requiring local processing for copyright or ethical reasons,” according to the researchers.
The code and datasets are available on GitHub.
Authors: Ran Zhang, Wei Zhao, Lieve Macken, and Steffen Eger