How Large Language Models Can Get Better at Machine Translation of Rare Words

A paper published by Meta AI natural language processing (NLP) researchers Marjan Ghazvininejad, Hila Gonen, and Luke Zettlemoyer on February 15, 2023, explores ways to improve machine translation (MT) capabilities on large language models (LLMs) via prompting, i.e., adapting language to create a precise instruction for an AI model, in this case, an LLM, using bilingual dictionaries.

The Meta AI researchers point out that (still) relatively few studies exist on prompting language models for machine translation. The problem the researchers specifically wanted to address is that “LLMs can struggle to translate inputs with rare words, which are common in low resource or domain transfer scenarios.”

Since general dictionaries are easily accessible, including those for low-resource languages, they are frequently used to train or improve supervised machine translation models. In this case, the researchers experimented to show that using bilingual dictionaries to insert control inputs in the prompts (multiple translations for a subset of the input words) produces better MT results compared to the baseline.

New paper: Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation

TLDR: We show that LLM prompting can provide an effective solution for rare words, by using bilingual dictionaries to provide control hints.

With @hila_gonen and @LukeZettlemoyer pic.twitter.com/sd37bjVcR3
— Marjan Ghazvininejad (@gh_marjan) February 17, 2023

Applying Word-Level Dictionary Data Directly

The methodology they used, dubbed “Dictionary-based Prompting for Machine Translation,” or DIPMT, goes a step further than previous research into low resource and domain transfer because there is no need for model training. Instead, DIPMT uses prompting-based translation, for which word-level dictionary data is input directly into the prompt.

The researchers based their experiments on two large language models, OPT for English and BLOOM for the multilingual set. For the out-of-domain evaluation, they used data from Aharoni and Goldberg (medical, law, IT, and Koran). They also removed from the training set sentences longer than 250 tokens and sentence pairs with a source/target length ratio of more than 1.5.

The experimental data set included (a) the source sentence (along with the translation instruction and the target language); (b) the dictionary-based word-level translations; and (c) the translation to the target language, which the model is expected to generate.

For the baseline, researchers used a prompt format without dictionary-based word-level translations. The baseline had two parts: (a) the source sentence (along with the translation instruction and the target language); and (b) the translation into the target language.

Further Improvement with Domain-Specific Dictionaries

The researchers experimented with translation to and from English using multiple languages and the above-mentioned language models, as well as out-of-domain translation. Out-of-domain data is known to impact translation quality precisely because it would not typically be included in model pre-training.

Domain-specific bilingual dictionaries are not as easy to find as general dictionaries, if they even exist. Likewise, not all source word types have a dictionary equivalent. To account for these factors, the researchers employed parallel data available for each domain and created domain-specific dictionaries.

Slator Machine Translation Expert-in-the-Loop Report

60-page report on the interaction between human experts and AI in translation production, including AI-enabled workflows, adoption rates, postediting, pricing models.

$585 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

The results of the experiments showed that the methodology based on dictionary inputs, adding possible word-level equivalents via prompting, outperformed the baseline by an average of 9:4 BLEU points.

The DIPMT approach appears to be promising as a way to improve MT quality, especially in domain-specific content (compared to other methodologies, such as domain-specific data augmentation through back translation).

The next step would be to include human experts in the evaluation phase. This would be a “data full circle” of sorts since the dictionaries and the parallel data used for prompting were all produced by humans.

Featured