Cultural Localization Remains a Weak Spot for AI Translation, Appen Says

Appen has published a new paper showing that even the most advanced large language models (LLMs) continue to struggle with culturally nuanced translation, particularly when handling idioms, puns, and figurative language in marketing content.

Released on February 4, 2026, this study builds on an October 2025 Appen paper that found multilingual AI systems often produce fluent but culturally inappropriate translations.

While the earlier work offered a high-level snapshot of the problem, the new paper introduces a more structured, statistically grounded evaluation and confirms that cultural localization remains a weak spot for AI translation.

The 2025 Appen study evaluated anonymised LLM outputs across more than 20 languages and dialects, identifying idioms, humor, and culturally specific expressions as common failure points.

The new paper takes a different approach. The researchers evaluate seven named models, both open- and closed-weight — GPT-5 and gpt-oss 120B, Claude Sonnet 3.7, Mistral Medium 3.1, Llama 4, DeepSeek V3.1, and Aya Expanse 8B — on the task of translating English marketing emails into 15 language–locale combinations.

The benchmark separates evaluation into two layers: full-text quality and segment-level performance on culturally sensitive elements such as idioms, puns, holiday references, and broader cultural concepts. 

Each translation is scored by multiple native speakers, using a structured rubric covering content fidelity, style, audience appropriateness, and overall quality. This allows the researchers to quantify not just whether translations are acceptable, but where and how they fail.

Idioms and Puns Remain the Hardest Problems

The results reinforce earlier warnings from Appen — but with clearer numerical evidence. Across all models, average full-text quality scores remain modest, with top systems achieving just over two-thirds of the maximum possible score. GPT-5 scores highest, followed by Claude Sonnet 3.7, and Mistral Medium 3.1. At the lower end of the scale is Cohere’s Aya Expanse 8B.

More pronounced than the differences between models is the gap between content types. Holiday references and general cultural concepts are handled relatively well, while idioms and puns consistently receive the lowest scores.

Idioms, in particular, are frequently left untranslated altogether, suggesting models may choose to omit difficult expressions rather than risk an incorrect adaptation. Overall, the findings indicate that figurative language remains difficult for LLMs to localize reliably, even in top-performing systems.

What Comes Next

Appen plans to release the dataset and evaluation framework as a public benchmark, allowing reproducible research on cultural localization in AI translation and multilingual LLM evaluation.

Future work will also extend the benchmark beyond text, including an audio-based version to assess spoken localization, where humour, tone, and emphasis play an even larger role. The researchers also plan to expand the benchmark to additional domains and languages.

“To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation,” an Appen researcher said.

The findings point to ongoing limitations in current state-of-the-art models and highlight the need for training data and evaluation methods that move beyond surface-level correctness toward real-world communicative competence.

Authors: Madison Van Doren, Casey Ford, Jennifer Barajas, and Cory Holland