To address this challenge, Rios proposed improving LLMs’ performance by incorporating specialized terminology through instruction tuning — a technique where models are fine-tuned using datasets from various tasks formatted as instructions. “Our goal is to incorporate terminology, syntax information, and document structure constraints into a LLM for the medical domain,” he said.
Specifically, Rios suggested including medical terms as part of the instructions given to the LLM. When translating a segment, the model is provided with relevant medical terms that should be used in the translation.
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
Additionally, the approach involves identifying pairs of terms — source and corresponding target terms — that are relevant to the text being translated, ensuring the correct medical terminology is applied to these segments during translation.
If one or more candidate terms are successfully matched in a segment, they are incorporated into the instruction template provided to the LLM. This means the model receives a prompt that not only instructs it to translate the text but also specifies which medical terms to use.
If no matching candidate terms are found, the model is given a basic translation task prompt, instructing it to translate the text without any specific medical terminology guidance.
Unbabel’s Tower Takes the Lead
For the experiments, Rios utilized Google’s FLAN-T5, Meta’s LLaMA-3-8B, and Unbabel’s Tower-7B as baseline models, applying QLoRA for parameter-efficient fine-tuning, and tested them across English-Spanish, English-German, and English-Romanian language pairs.
The results revealed that the instruction-tuned models “significantly” outperformed the baselines in terms of automatic metrics such as BLEU, chrF, and COMET scores. Specifically, the Tower-7B model showed the best performance in English-Spanish and English-German translations, followed by LLaMA-3-8B, which demonstrated strong performance in English-Romanian translations.
Talking to Slator, Rios expressed his intention to perform a manual evaluation with professional translators in the future, as automated metrics alone may not fully capture how well the models generate the correct medical terms in their translations.