The crux of the issue, according to the paper, is achieving more natural-sounding translation. Current evaluation metrics such as BLEU, however, often reward monotonic and simplistic output. In practice, this means that MT systems can “cheat” by producing the simplest possible translation — not necessarily the highest quality translation — to yield the highest BLEU score.
Lead researcher Markus Freitag told Slator that rather than fix the metric, as many researchers have attempted before, the team at Google decided to look at the problem from a different angle and to try to remove the translation bias.
Slator Machine Translation Expert-in-the-Loop Report
60-page report on the interaction between human experts and AI in translation production, including AI-enabled workflows, adoption rates, postediting, pricing models.
“Before we can actually improve the MT system, we need to get a better evaluation of translation quality, both automated and human,” Freitag said. “Once we’re happy with the automated evaluation, we want to improve the underlying MT system, which is actually really difficult to improve.”
In Other Words…
Research has shown that when humans paraphrase standard references for use in automated MT evaluation, the automated evaluation correlates better with human judgment. Most notably, it sidesteps the system’s preference for monotonic translations that contain the same words as the reference, resulting in a fairer assessment of alternative, equally good translations.
It sidesteps the system’s preference for monotonic translations that contain the same words as the reference.
Without providing the source sentences, researchers asked professional linguists to paraphrase reference translations as much as possible, which included using different wording and sentence structures, while keeping the reference a natural instance of the target language.
A second group of professional human translators was asked to rate the reference translations, both paraphrased and not, in side-by-side evaluations, again without the source sentences. The vast majority of human translators preferred the paraphrased reference translations, indicating that they were of a higher quality than the MT output.
The researchers then evaluated, step-by-step, the design choices behind the best-performing English–German system from WMT2019, with the goal of determining their impact on standard reference BLEU versus their impact on paraphrased BLEU (referred to in the paper as BLEUP). Those steps included data cleaning, fine-tuning, and back translation.
Their findings showed that engines optimized for BLEUP made gains in adequacy and fluency when evaluated by humans, and produced noticeably less literal translations. Moreover, as BLEUP scores increased, standard BLEU scores tended to decrease.
“Paraphrased automatic evaluation therefore seems to be a promising proxy for human evaluation when making design choices for MT systems”
“Paraphrased automatic evaluation therefore seems to be a promising proxy for human evaluation when making design choices for MT systems,” the researchers concluded.
BLEU-Colored Glasses
There are still use cases where clients might purposely want a simpler, more direct translation, such as content intended for language-learners; pharmaceutical translations; and dubbing or speech translation that needs to match a speaker’s actions as closely as possible.
That said, Freitag considers BLEUP a promising replacement for BLEU, and expects it would be helpful for any content containing longer, well-formed sentences, and across the board for high-resource language pairs.
“A lot of people are actually asking me, ‘Can you provide it for other language pairs?’” Freitag said. “Of course that would be nice, but costly. Even Google is very open to releasing the data and doing a lot of work, but it’s definitely not feasible.”
Slator 2021 Data-for-AI Market Report
44-pages on how LSPs enter and scale in AI Data-as-a-service. Market overview, AI use cases, platforms, case studies, sales insights.
Ultimately, of course, the goal of industry-backed research is to integrate findings into consumer-facing products. Freitag confirmed that the research behind the paper was basically a test batch for Google’s production system.
“The next step is to incorporate everything into Google Translate and hopefully get a better translation experience,” Freitag said, adding that there are several improvements for high-resource languages in the pipeline. Rather than update the model every two or three months, Google wants to combine the updates and launch them together. “Hopefully, by the beginning of next year we’ll have something in the works.”
Outside of Google’s offerings, Freitag sees an opportunity for this research to spark a reevaluation of past design decisions that ended up favoring simple, monotonic translations.
“The next step is to incorporate everything into Google Translate and hopefully get a better translation experience”
“A lot of decisions we made in past years were actually driven by BLEU,” Freitag said. “I think this is an investment for the future and a lot of other researchers could benefit. It could actually show research done a few years ago, because of BLEU, could be better.”