Appen Finds LLMs Struggle with Idioms and Culture in Multilingual AI Translations

A new study from Appen has found that large language models (LLMs) consistently stumble on idioms, puns, and cultural nuance when translating marketing content.

“As multilingual LLMs are increasingly integrated into global content workflows, understanding their ability to produce culturally appropriate translations is critical for effective localisation,” said Madison Van Doren, AI Research & Strategy Manager at Appen, and Cory Holland, Senior Linguist at Appen.

Van Doren and Holland tested three LLMs. Models were anonymised as the goal was not benchmarking per se, but capturing a “state-of-use” snapshot across 24 dialects and 20 languages. The source material was marketing emails with figurative expressions such as “Will you brie mine?” and “cat’s meow.” A total of 22 human evaluators reviewed the translations for content accuracy, tone, and cultural fit.

They found that none of the translations were ready to publish without edits. Even when grammatically correct, outputs often lost humor, tone, cultural relevance, or marketing appeal.

Where LLMs Fall Short

Figurative and playful language (i.e., figurative expressions and wordplay) was the biggest weakness. Puns and idioms were frequently translated literally, resulting in clunky or confusing text.

2025 Cover Slator Pro Guide Translation AI

2025 Slator Pro Guide: Translation AI

The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.

$355 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

Cultural resonance also proved difficult, with LLMs often missing the right tone or failing to adapt references to local traditions. In several cases, evaluators had to rewrite the text to anchor it in culturally relevant holidays or idiomatic expressions.

Van Doren and Holland noted that the overall performance was uneven across languages. Interestingly, they found that high-resource languages did not necessarily perform better than low-resource ones. Linguistic proximity to English was also no guarantee of higher quality, challenging assumptions that linguistic proximity to English boosts accuracy.

Script type, however, appeared to play a role. Korean and Japanese scored relatively well despite their distance from English, possibly because their writing systems align better with model tokenization, while Mandarin fared worse, a result they link to challenges in handling logographic writing systems.

Although Van Doren and Holland call the work a “small pilot,” it adds empirical evidence to a growing consensus: LLMs can speed up multilingual workflows, but cultural nuance and idiomatic expression remain key weaknesses and “a clear area for improvement” — for now, firmly in human hands.

“Cultural appropriateness and overall localisation quality” are “critical factors for real-world applications like marketing and e-commerce” and “this pilot highlights limitations of current multilingual AI systems for real-world localisation use cases,” they concluded.

Featured