Instruction-Tuned Large Language Models Outperform Baselines in Medical Translation

In an August 29, 2024 paper, Miguel Rios from the University of Vienna explored how instruction-tuned large language models (LLMs) can improve machine translation (MT) in specialized fields, particularly in the medical domain.

Rios noted that while state-of-the-art LLMs have shown promising results for high-resource language pairs and domains, they often struggle with accuracy and consistency in specialized, low-resource domains. “In specialized domains (e.g. medical) LLMs have shown lower performance compared to standard neural machine translation models,” Rios said.

He also explained that the limitations of LLMs in low-resource domains stem from their training data, which may not adequately cover the specific terminology and contextual nuances required for effective translation.

To address this challenge, Rios proposed improving LLMs’ performance by incorporating specialized terminology through instruction tuning — a technique where models are fine-tuned using datasets from various tasks formatted as instructions. “Our goal is to incorporate terminology, syntax information, and document structure constraints into a LLM for the medical domain,” he said.

Specifically, Rios suggested including medical terms as part of the instructions given to the LLM. When translating a segment, the model is provided with relevant medical terms that should be used in the translation.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

$365 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

Additionally, the approach involves identifying pairs of terms — source and corresponding target terms — that are relevant to the text being translated, ensuring the correct medical terminology is applied to these segments during translation.

If one or more candidate terms are successfully matched in a segment, they are incorporated into the instruction template provided to the LLM. This means the model receives a prompt that not only instructs it to translate the text but also specifies which medical terms to use.

If no matching candidate terms are found, the model is given a basic translation task prompt, instructing it to translate the text without any specific medical terminology guidance.

Unbabel’s Tower Takes the Lead

For the experiments, Rios utilized Google’s FLAN-T5, Meta’s LLaMA-3-8B, and Unbabel’s Tower-7B as baseline models, applying QLoRA for parameter-efficient fine-tuning, and tested them across English-Spanish, English-German, and English-Romanian language pairs.

The results revealed that the instruction-tuned models “significantly” outperformed the baselines in terms of automatic metrics such as BLEU, chrF, and COMET scores. Specifically, the Tower-7B model showed the best performance in English-Spanish and English-German translations, followed by LLaMA-3-8B, which demonstrated strong performance in English-Romanian translations.

Talking to Slator, Rios expressed his intention to perform a manual evaluation with professional translators in the future, as automated metrics alone may not fully capture how well the models generate the correct medical terms in their translations.

Featured