They proposed a two-step process that leverages human feedback (i.e., error markings) to enhance the capabilities of LLMs.
During the first step, translators identify and mark errors in the machine-generated translations. Error markings are inside bold-faced tags <bad> </bad> and do not provide any information on the types of errors or their severity.
These error-marked segments are then used to prompt the LLMs, guiding them to focus on correcting the marked errors by referencing similar examples from a post-editing translation memory (PE-TM).
Slator Pro Guide: Language AI for Consumers
This 16-page guide explores how consumers are using AI to generate, translate, edit, and dub speech and text in multiple languages.
A PE-TM consists of source segments, machine translations, and reference translations, enriched by lightweight human error markings on machine translation. By providing the LLM with instances where errors have been correctly identified and corrected, the LLM can learn from these examples and apply similar corrections to its own translations.
To test the effectiveness of this process, they conducted a pilot study in the IT domain and the English-German language pair. First, for creating a PE-TM they used data from open-source software documentation that were annotated by professional translators. Then, they employed Llama 13B and GPT-3.5 for generating and correcting translations.
They considered three machine translation tasks: machine translation from scratch, automatic post-editing, and post-editing with error markings
In the first scenario, models were prompted to simply translate the text. In the second scenario, models were prompted to read the original text and the translation hypothesis and then correct the output. And in the third scenario, models were prompted to read the original text and the translation hypothesis and then correct the output using the provided error markings.
Prompt: Read the English text and the German translation hypothesis and then correct the output. Incorrect words are inside of tags ’<bad> </bad>’. Please use this feedback in your correction. If the hypothesis is already correct, do not make any changes.
The researchers noted that giving the error markings as in-line tags would be easier for the model to parse and integrate into its output than including another line where errors would be indicated further away from the corresponding tokens.
They found that providing error markings significantly improved the LLM’s ability to correct translations. The approach outperformed translation from scratch and automatic post-editing. “Overall translation quality is improved over few-shot prompt-based translation and over automatic post-editing,” they said.
Additionally, they found that the LLM that produced the translation hypotheses identifies its own translations as correct, and therefore does not act on the instructions to correct errors. However, when prompted with error markings, the LLM learned to act on them, with 68% of the edits being correct according to human evaluation, compared to 32% during automatic post-editing.