The researchers compared Gemini Pro, GPT-3.5 Turbo, and GPT-4 Turbo against established systems like Google Translate and benchmarked them against NLLB-MoE, an open-source machine translation (MT) model known for its extensive language coverage.
These models were evaluated across 20 languages with various levels of resource availability and translation difficulty, looking particularly at how well the models performed with translations from English to other languages (ENG→X). To evaluate the outputs, the researchers used standard metrics, such as BLEU and chrF2++.
While Google Translate outperformed other models, excelling in 10 languages, the language models demonstrated competitive performance but fell short in translation into non-English languages.
GPT-4 Turbo showcased performance deviations compared to GPT-3.5 Turbo and Gemini Pro. Notably, GPT-4 Turbo demonstrated larger improvements for low-resource languages, whereas performance was similar between the large language models (LLMs) for high-resource languages.
Gemini Pro outperformed both GPT-3.5 Turbo and GPT-4 Turbo in five out of 20 languages, achieving top performance in three languages. However, it exhibited a tendency to block responses in scenarios of lower confidence in approximately 10 language pairs. The researchers attributed Gemini Pro’s lower performance in some languages to this tendency.
A closer examination revealed that Gemini Pro marginally outperformed GPT-3.5 Turbo and GPT-4 Turbo in unblocked samples, where it demonstrated higher confidence. Specifically, it surpassed GPT-4 Turbo by 1.6 chrf in 5-shot and 2.6 chrf in 0-shot settings, and exceeded GPT-3.5 Turbo by 2.7 chrf and 2 chrf in 5-shot and 0-shot settings, respectively.
Despite the observed challenges in translating certain samples, the authors emphasized Gemini Pro’s competitive performance over other models on Cyrillic scripts, in contrast to its underperformance on other scripts. GPT-4 stood out, outperforming both Gemini Pro and GPT-3.5 Turbo across various scripts, and it was particularly effective in languages using the Devanagari script.
The authors concluded with a recommendation for researchers and practitioners to consider Gemini Pro as a “valuable tool in their toolkit, comparable to GPT-3.5 Turbo.”
Despite acknowledged limitations, the study provided a transparent and reproducible analysis, inviting the community to explore and scrutinize the findings. For those interested in reproducing the results, the code and data can be found at https://github.com/neulab/gemini-benchmark.