Their experiments involved data filtering, training through reinforcement learning (RL), inference time reranking techniques, and a combination of these methods.
The researchers underlined the novelty of their work, stating that “none of the previous work has systematically compared the effect of integrating metrics at different stages of the MT pipeline or has attempted to combine these techniques in a unified approach.”
“By doing this, the MT system will prioritize translations that are more aligned with human judgments, therefore reducing the chances of generating severely incorrect translations”
Feedback at an Early Stage
A significant contribution of the study is the proposal of an alternative data filtering method using COMET-QE. The researchers explained that COMET-QE is an ideal preference model for data filtering, being a multilingual reference-free neural-based metric trained on human annotations of translation quality, “accurate” in QE, and with a “superior alignment with human judgments.”
The proposed method aims to curate high-quality datasets, effectively minimizing RL training instability. The researchers said that such quality-aware data filtering could “significantly increase the performance of MT systems by introducing feedback in an early stage of the pipeline.”
However, the effectiveness of this process depends on the selected metric. Using metrics not closely aligned with human judgments can result in poorly correlated and misaligned sentences, making the training process more unstable. Therefore, the use of robust QE models — such as COMET-QE or the more recent COMETKIWI model — is important.
Additionally, the researchers noted that quality metrics can have a “pivotal role” in classic RL training by providing rewards to optimize the MT model’s performance.
The RL-based training process involves a neural machine translation (NMT) model that generates translations which are in turn evaluated by the reward model (through rewards that indicate the quality of the translation). These rewards are then used by the policy gradient algorithm to refine the NMT model’s policy.
In contrast to previous works that predominantly used BLEU as the reward function, this study (again) identified the limitations of BLEU. This prompted researchers to leverage robust preference models during RL training, such as the reference-based COMET and the reference-free COMET-QE. The researchers explained that by incorporating these pre-trained preference models, the RL systems can better capture nuanced user preferences by receiving human-like feedback as rewards.
Slator Pro Guide: Translation AI
The Slator Pro Guide presents 10 new and impactful ways that LLMs can be used to enhance translation workflows.
The results revealed that, in some cases, RL-based training alone did not yield significant improvements, but when combined with high-quality training datasets, it resulted in substantial enhancements.
The performance gains with COMET-QE, used as both data filter and reward model, emphasized the potential of RL-based NMT models trained with a QE reward model to outperform other RL-trained models. According to the researchers, this suggests promising opportunities for unsupervised NMT training with monolingual data — especially for low-resource languages —- by eliminating the need for reference translations in evaluation and reward signal generation.
Prioritizing Human-Aligned Translations
Finally, the researchers recommended incorporating quality metrics as rerankers during the decoding phase, prioritizing and selecting translations aligned with human judgments, minimizing the risk of generating inaccuracies.
“By doing this, the MT system will prioritize translations that are more aligned with human judgments, therefore reducing the chances of generating severely incorrect translations,” they said.
The researchers suggested that even if the underlying model has already undergone RL training using the same or a different preference model, incorporating preference models during the decoding stage can further improve translation quality.