Tencent Pits ChatGPT Translation Quality Against DeepL and Google Translate

Since OpenAI launched ChatGPT in November 2022, headlines have asked whether workers in a range of fields should worry about being replaced by the advanced AI chatbot. Now, a January 2023 paper from Chinese tech company, Tencent, asks the question on behalf of the language industry: Is ChatGPT A Good Translator?

The Tencent team goes about answering the question by reviewing, shall we say, a limited set of data. The team said “obtaining the translation results from ChatGPT is time-consuming since it can only be interacted with manually and can not respond to large batches. Thus, we randomly sample 50 sentences from each set for evaluation.” So, let’s see what insights the team gathered by evaluating those 50 sentences.

According to the paper, ChatGPT performs “competitively” with commercial machine translation (MT) products, such as Google Translate, DeepL and Tencent’s own system, on high-resource European languages, but struggles with low-resource or unrelated language pairs.

In other words, one observer on Twitter quipped, “Potential alternative headline / interpretation: ‘ChatGPT was trained for translation on common publicly available parallel corpora.’”

For this “preliminary study,” Tencent AI Lab researchers, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu evaluated translation prompts, multilingual translation, and translation robustness.

Meta Moment

The experiment started with a “meta” moment, when the team asked ChatGPT itself for prompts or templates that would trigger its MT ability. The prompt that produced the best Chinese–English translations was then used for the rest of the study — 12 directions total between Chinese, English, German, and Romanian.

MAIN IMAGE - 2025 Language Industry Market Report

Slator 2025 Language Industry Market Report

The 150-page report offers a comprehensive view of the 2025 global market — with market sizing, AI capability breakdowns, buyer insights, use cases, survey data, and projections through 2030.

$970 BUY NOW Included in our Growth, Pro, and
Enterprise plans. Subscribe now!

Researchers were curious as to how ChatGPT’s performance might vary by language pair. While ChatGPT performed “competitively” with Google Translate and DeepL for English–German translation, its BLEU score for English–Romanian translation was 46.4% lower than that of Google Translate.

The team attributed the poor performance to the pronounced difference in monolingual data for English and Romanian, which “limits the language modeling capability of Romanian.”

Romanian–English translation, on the other hand, “can benefit from the strong language modeling capability of English such that the resource gap of parallel data can be somewhat compensated,” for a BLEU score just 10.3% below Google Translate.

Beyond the Family

Beyond resource differences, the authors wrote, translating between language families is considered more difficult than translating within language families. The difference in the quality of ChatGPT’s output for German–English versus Chinese–English translation seems to bear this out.

Researchers observed an even greater performance gap between ChatGPT and commercial MT systems for low-resource language pairs from different families, such as Romanian–Chinese.

“Since ChatGPT handles different tasks in one model, low-resource translation tasks not only compete with high-resource translation tasks, but also with other NLP tasks for the model capacity, which explains their poor performance,” they wrote.

They only sampled 50 sentences, since they don’t know how to automate ChatGPT translation
— Ofer Rahat (@OferRahat) January 25, 2023

Google Translate and DeepL both surpassed ChatGPT in translation robustness on two out of three test sets: WMT19 Bio (Medline abstracts) and WMT20 Rob2 (Reddit comments), likely thanks to their continuous improvement as real-world applications fed by domain-specific and noisy sentences.

However, ChatGPT outperformed Google Translate and DeepL “significantly” on the WMT20 Rob3 test set, which contained a crowdsourced speech recognition corpus. The authors believe this finding suggests that ChatGPT is “capable of generating more natural spoken languages than these commercial translation systems,” hinting at a possible future area of study.

Featured