“These aren’t artificial examples — they reflect the kind of post-edits linguists make every day to ensure clarity, cultural fit, and brand consistency,” Welocalize noted. “To the best of our knowledge, LangMark is the largest multilingual, human-annotated dataset,” they added.
Benchmarking LLMs
Using LangMark, the researchers evaluated various closed- and open-source LLMs, including GPT-4o, Claude 3.5 (Sonnet and Haiku), Gemini 1.5, Llama 3, and Qwen 2.5.
Slator 2025 AI Dubbing Report
The 85-page report analyzes the supply and demand for AI dubbing and the technical and operational nuances in delivering AI dubbing across verticals.
They found that GPT-4o consistently outperformed all other models, both closed- and open-source, particularly in Japanese and Russian, where more edits were required. Among closed models, Gemini-1.5 Flash and Claude 3.5 Haiku followed closely. Open-source Qwen 2.5-72B rivaled closed models, achieving top results in Russian.
They also found that high-performing models like GPT-4o exhibited “conservative” behavior, editing only when necessary and generally aligning well with human judgments. In contrast, more “aggressive” models like Claude 3.5 Sonnet, which flagged more segments as needing edits, often introduced unnecessary changes, lowering overall quality.
The researchers emphasized that a key challenge in AI post-editing is knowing when not to intervene. While some segments require corrections, others are best left untouched.
Ongoing Value of Human Expertise
While this work showed that some LLMs with few-shot prompting can effectively perform AI post-editing, Welocalize underlined the “ongoing value of human expertise.”
“Expert post-editors remain essential for catching subtle mistakes, aligning with client expectations, and delivering content that resonates across markets,” they said.
Rethinking Evaluation
The researchers also identified a gap in current evaluation practices. Traditional metrics are “insufficient” for assessing AI post-editing, since they cannot capture the key judgment of whether to edit at all.
They argue that “an ideal evaluation metric should consider both the quality of the final output and the number of edits performed, accounting for the balance between unnecessary conservatism and excessive intervention.”
Although this work does not propose such a metric, the researchers hope it will encourage the development of more comprehensive evaluation frameworks and guide the design of AI post-editing systems that better align with human post-editing strategies.
Authors: Diego Velazquez, Mikaela Grace, Konstantinos Karageorgos, Lawrence Carin, Aaron Schliem, Dimitrios Zaikis, and Roger Wechsler