The researchers explained that each version has strengths and weaknesses. Sentence-level translations tend to be more fluent and accurate at the sentence level but often lack consistency across the document. For example, the same term might be translated differently from one sentence to the next. Document-level translations, by contrast, are more consistent and context-aware — but may omit details or entire phrases.
To address this, they combine the two outputs and let the LLM refine them into a single, better translation. “We propose finetuning LLMs for translation refinement using two intermediate translations, combining the strengths of both Sent2Sent and Doc2Doc,” the researchers said.
Better Quality
To test the effectiveness of their approach, the researchers fine-tuned two open-source models — LLaMA-3-8B-Instruct and Mistral-Nemo-Instruct.
To train the model, they used a dataset of source documents, the two intermediate translations (sentence-level and document-level), and a human reference translation. The LLM is trained to compare the two inputs and generate a better output.
To help the model focus on the parts that need the most improvement, the researchers introduced a quality-aware training method. Translations that are already close to the final version are given less importance during training, while more difficult or error-prone segments are given more weight — helping the model learn where improvements really matter.
The method was tested on ten language directions, including English </> German, French, Chinese, and Russian. Across all tasks, the dual-translation refinement approach outperformed models trained to refine only one version of the translation, according to the researchers.
“Our refinement approach, based on the two intermediate translations […], significantly improves translation performance across all language pairs,” they said
For example, using this method, LLaMA-3-8B-Instruct gained up to +2.7 COMET points. Mistral-Nemo-Instruct showed similar improvements. This suggests that even smaller LLMs (7B parameters) can effectively refine translations when properly fine-tuned.
Moreover, the refined models also improved translations from other systems — including GPT-4o-mini and NLLB — showing that this approach can serve as a post-processing layer even for strong AI translation outputs.
The code is available on GitHub.
Authors: Yichen Dong, Xinglin Lyu, Junhui Li, Daimeng Wei, Min Zhang, Shimin Tao, and Hao Yang