The 2025 Appen study evaluated anonymised LLM outputs across more than 20 languages and dialects, identifying idioms, humor, and culturally specific expressions as common failure points.
The new paper takes a different approach. The researchers evaluate seven named models, both open- and closed-weight — GPT-5 and gpt-oss 120B, Claude Sonnet 3.7, Mistral Medium 3.1, Llama 4, DeepSeek V3.1, and Aya Expanse 8B — on the task of translating English marketing emails into 15 language–locale combinations.
The benchmark separates evaluation into two layers: full-text quality and segment-level performance on culturally sensitive elements such as idioms, puns, holiday references, and broader cultural concepts.
Each translation is scored by multiple native speakers, using a structured rubric covering content fidelity, style, audience appropriateness, and overall quality. This allows the researchers to quantify not just whether translations are acceptable, but where and how they fail.
Idioms and Puns Remain the Hardest Problems
The results reinforce earlier warnings from Appen — but with clearer numerical evidence. Across all models, average full-text quality scores remain modest, with top systems achieving just over two-thirds of the maximum possible score. GPT-5 scores highest, followed by Claude Sonnet 3.7, and Mistral Medium 3.1. At the lower end of the scale is Cohere’s Aya Expanse 8B.
More pronounced than the differences between models is the gap between content types. Holiday references and general cultural concepts are handled relatively well, while idioms and puns consistently receive the lowest scores.
Idioms, in particular, are frequently left untranslated altogether, suggesting models may choose to omit difficult expressions rather than risk an incorrect adaptation. Overall, the findings indicate that figurative language remains difficult for LLMs to localize reliably, even in top-performing systems.
2025 Slator Pro Guide: Translation AI
The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.
What Comes Next
Appen plans to release the dataset and evaluation framework as a public benchmark, allowing reproducible research on cultural localization in AI translation and multilingual LLM evaluation.
Future work will also extend the benchmark beyond text, including an audio-based version to assess spoken localization, where humour, tone, and emphasis play an even larger role. The researchers also plan to expand the benchmark to additional domains and languages.
“To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation,” an Appen researcher said.
The findings point to ongoing limitations in current state-of-the-art models and highlight the need for training data and evaluation methods that move beyond surface-level correctness toward real-world communicative competence.
Authors: Madison Van Doren, Casey Ford, Jennifer Barajas, and Cory Holland