Welocalize and Duke University Release Benchmark for AI Translation Post-Editing

Welocalize, in collaboration with Duke University, introduced LangMark, a new multilingual, human-annotated dataset built to evaluate the translation post-editing capabilities of large language models (LLMs).

The dataset comprises 206,983 triplets of English source text, machine-translated output, and human post-edited translation across seven target languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish.

What sets LangMark apart, according to Welocalize, is its focus on real-world, domain-specific content, drawn from actual marketing materials, and the fact that all translations were refined by experienced linguists with at least five years of industry experience and three or more years specializing in marketing translation and post-editing.

“These aren’t artificial examples — they reflect the kind of post-edits linguists make every day to ensure clarity, cultural fit, and brand consistency,” Welocalize noted. “To the best of our knowledge, LangMark is the largest multilingual, human-annotated dataset,” they added.

Benchmarking LLMs

Using LangMark, the researchers evaluated various closed- and open-source LLMs, including GPT-4o, Claude 3.5 (Sonnet and Haiku), Gemini 1.5, Llama 3, and Qwen 2.5.

Slator 2025 AI Dubbing Report

The 85-page report analyzes the supply and demand for AI dubbing and the technical and operational nuances in delivering AI dubbing across verticals.

$690 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

They found that GPT-4o consistently outperformed all other models, both closed- and open-source, particularly in Japanese and Russian, where more edits were required. Among closed models, Gemini-1.5 Flash and Claude 3.5 Haiku followed closely. Open-source Qwen 2.5-72B rivaled closed models, achieving top results in Russian.

They also found that high-performing models like GPT-4o exhibited “conservative” behavior, editing only when necessary and generally aligning well with human judgments. In contrast, more “aggressive” models like Claude 3.5 Sonnet, which flagged more segments as needing edits, often introduced unnecessary changes, lowering overall quality.

The researchers emphasized that a key challenge in AI post-editing is knowing when not to intervene. While some segments require corrections, others are best left untouched.

Ongoing Value of Human Expertise

While this work showed that some LLMs with few-shot prompting can effectively perform AI post-editing, Welocalize underlined the “ongoing value of human expertise.”

“Expert post-editors remain essential for catching subtle mistakes, aligning with client expectations, and delivering content that resonates across markets,” they said.

Rethinking Evaluation

The researchers also identified a gap in current evaluation practices. Traditional metrics are “insufficient” for assessing AI post-editing, since they cannot capture the key judgment of whether to edit at all.

They argue that “an ideal evaluation metric should consider both the quality of the final output and the number of edits performed, accounting for the balance between unnecessary conservatism and excessive intervention.”

Although this work does not propose such a metric, the researchers hope it will encourage the development of more comprehensive evaluation frameworks and guide the design of AI post-editing systems that better align with human post-editing strategies.

Authors: Diego Velazquez, Mikaela Grace, Konstantinos Karageorgos, Lawrence Carin, Aaron Schliem, Dimitrios Zaikis, and Roger Wechsler

Featured