In other words, one observer on Twitter quipped, “Potential alternative headline / interpretation: ‘ChatGPT was trained for translation on common publicly available parallel corpora.’”
For this “preliminary study,” Tencent AI Lab researchers, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu evaluated translation prompts, multilingual translation, and translation robustness.
The experiment started with a “meta” moment, when the team asked ChatGPT itself for prompts or templates that would trigger its MT ability. The prompt that produced the best Chinese–English translations was then used for the rest of the study — 12 directions total between Chinese, English, German, and Romanian.
Slator 2025 Language Industry Market Report
The 150-page report offers a comprehensive view of the 2025 global market — with market sizing, AI capability breakdowns, buyer insights, use cases, survey data, and projections through 2030.
Researchers were curious as to how ChatGPT’s performance might vary by language pair. While ChatGPT performed “competitively” with Google Translate and DeepL for English–German translation, its BLEU score for English–Romanian translation was 46.4% lower than that of Google Translate.
The team attributed the poor performance to the pronounced difference in monolingual data for English and Romanian, which “limits the language modeling capability of Romanian.”
Romanian–English translation, on the other hand, “can benefit from the strong language modeling capability of English such that the resource gap of parallel data can be somewhat compensated,” for a BLEU score just 10.3% below Google Translate.
Beyond the Family
Beyond resource differences, the authors wrote, translating between language families is considered more difficult than translating within language families. The difference in the quality of ChatGPT’s output for German–English versus Chinese–English translation seems to bear this out.
Researchers observed an even greater performance gap between ChatGPT and commercial MT systems for low-resource language pairs from different families, such as Romanian–Chinese.
“Since ChatGPT handles different tasks in one model, low-resource translation tasks not only compete with high-resource translation tasks, but also with other NLP tasks for the model capacity, which explains their poor performance,” they wrote.
Google Translate and DeepL both surpassed ChatGPT in translation robustness on two out of three test sets: WMT19 Bio (Medline abstracts) and WMT20 Rob2 (Reddit comments), likely thanks to their continuous improvement as real-world applications fed by domain-specific and noisy sentences.
However, ChatGPT outperformed Google Translate and DeepL “significantly” on the WMT20 Rob3 test set, which contained a crowdsourced speech recognition corpus. The authors believe this finding suggests that ChatGPT is “capable of generating more natural spoken languages than these commercial translation systems,” hinting at a possible future area of study.