In addition, the authors emphasized that “the majority of LRLs are largely neglected in language technologies” in general with current MT systems either performing poorly on them or not including them at all. “Some commercial systems like Google Translate support a number of LRLs, but many systems do not support any,” they said.
The authors pointed out that their work differs from existing studies since the focus here is on end users. The inclusion of a remarkable 204 languages, which incorporates 168 LRLs, underscores the commitment to addressing the diverse needs of LRL communities, which are frequently overlooked in the discourse on language technology. “We include more languages than any existing work […] to address the needs of various LRL communities,” they explained.
Slator Pro Guide: Translation AI
The Slator Pro Guide presents 10 new and impactful ways that LLMs can be used to enhance translation workflows.
To conduct their research, the team used data from FLORES-200 (an evaluation benchmark) and queried the OpenAI API to translate their test set from English into the target languages.
They evaluated ChatGPT’s MT performance across the entire language set and compared it with NLLB-MOE as their baseline, as it is the current state-of-the-art open-source MT model with wide language coverage. Comparative evaluations were also carried out against results from subsets of selected languages using Google Translate and GPT-4.
In their exploration of MT prompts, they employed both zero- and five-shot approaches for ChatGPT MT. The evaluation metrics, spBLEU and chrF2++, provided a robust basis for assessing the outputs.
The results suggest that while ChatGPT models approach or even surpass the performance of traditional MT models for some high-resource languages, they consistently lag for LRLs. Notably, African languages emerge as a particular challenge, with ChatGPT underperforming traditional MT in a substantial 84.1% of the languages studied.
Language Resources and Costs
The researchers also examined language features, including language resources, language family, and script, to assess the effectiveness of LLMs.
This analysis aimed to uncover trends that could guide end users in selecting the most appropriate MT system for their specific language. “Analyzing this may reveal trends helpful to end users deciding which MT system to use, especially if their language is not represented here but shares some of the features we consider,” they said.
According to the authors, a language’s resource level is the most important feature in predicting ChatGPT’s MT effectiveness, while script is the least important.
The authors stressed financial aspects as well, particularly as it pertains to LLM users. “We evaluate monetary costs, since they are a concern for LLM users,” the authors said. Few-shot prompts, despite their potential for modest improvements in translation quality, come at a higher cost due to charges for both input and output tokens.
The authors emphasized that they want to help end users of various language communities know how and when to use LLM MT. “We expect that our contributions may benefit both direct end users, such as LRL speakers in need of translation, and indirect users, such as researchers of LRL translation considering ChatGPT to enhance specialized MT systems,” they concluded.