Their primary objective was to determine whether pre-translation remains necessary for optimal LLM performance, with a specific focus on PaLM2, which is known for its high performance in multilingual tasks.
“By revealing PaLM2’s superiority with direct inference and offering robust evaluation tools, we aim to inspire further LLM development that transcends pre-translation, paving the way for seamless multilingual communication,” they said.
The team conducted a comprehensive comparative analysis of direct inference and pre-translation using PaLM2 models across a variety of discriminative and generative tasks in multiple languages. The analysis included 108 languages and six diverse benchmarks, encompassing both close-ended tasks like multiple choice question answering and reasoning, and open-ended tasks like text generation for attributive question answering and summarization.
Limited Exploration
In close-ended tasks, the model selects the correct answer from predefined options, focusing on specific information retrieval or confirmation. Open-ended tasks assess the model’s generative abilities by requiring it to generate text. Attributive question answering evaluates the model’s accuracy in responding to natural language questions, while text summarization condenses lengthy texts into concise pieces conveying essential information.
The researchers noted that recent studies have explored the impact of pre-translation on discriminative tasks, but there has been limited exploration into the impact on the generative capabilities of LLMs.
The pre-translation pipeline involved translating the input question from the source language to English, processing it, and then translating the generated answer back to the source language. In contrast, the direct inference pipeline involves processing input directly in the source language without any translation and generating the answer in the source language.
Linguistic Authenticity
The researchers evaluated two PaLM2 variants: PaLM2-S (Bison), and PaLM2-L (Unicorn). For pre-translation, they employed the Google Translate API. The results revealed that the PaLM2 models outperformed the pre-translation approach in 94 out of 108 languages when employing direct inference.
Slator Life Sciences and Language AI Report
The 70-page report provides an in-depth analysis of the pharmaceutical and clinical demand for language services, AI, and technology.
However, pre-translation consistently showed superiority in seven languages: Bambara, Cusco-Collao Quechua, Lingala, Oromo, Punjabi, Tigrinya, and Tsonga. All of them were low-resource languages (LRL) with four out of the seven being African languages, suggesting a need for special attention when creating multilingual training sets, particularly for African languages.
Further analysis focusing on LRLs indicated that while direct inference with PaLM2 might face challenges in these languages, over 85% of them actually benefited from direct inference, with significant improvements observed in the majority. This suggests that the observed performance differences may have regional origins, emphasizing the importance of further investigation and the need for customized approaches to enhance model performance in multilingual tasks, particularly for specific language families and regions.
The researchers concluded that “these findings pave the way for more efficient and effective multilingual applications, alleviating the limitations associated with pre-translation and unlocking linguistic authenticity.”
Authors: Yotam Intrator, Matan Halfon, Roman Goldenberg, Reut Tsarfaty, Matan Eyal, Ehud Rivlin, Yossi Matias, Natalia Aizenberg