They also cite Section 1557 of the US Affordable Care Act (ACA), which requires certain machine translations to undergo human review. The researchers claim this “high level” mandate is the only guidance on the subject at the federal level.
Common Concerns
The researchers discuss aspects of AI translation that “influence its adoption, implementation and sustainability” in the first of five sections of the roadmap. Some of the major concerns include patient privacy, operational costs, and the limitations of AI tools, consistent with what other research and institutions have voiced before.
“[Patients] face unsafe care because discharge instructions and other materials are rarely translated in time.” — Lopez et al.
The researchers make a specific recommendation for maintaining the confidentiality of protected health information (PHI), saying that institutions should use zero-data-retention (ZDR) endpoints, particularly when adopting a third-party closed-source large language model (LLM) for AI translation.
They also advise training translators to identify typical limitations of models, like AI hallucination, context loss, and bias. Furthermore, to keep translators from becoming overly reliant on AI and potentially failing to accurately evaluate a model’s output, the researchers highlight the need for safeguards in translation workflows.
A specific recommendation is to intermittently ask translators to justify their choice not to edit a segment translated by AI.
The authors also emphasize the importance of transparency on the part of healthcare institutions to make sure patients understand how AI translation is being used, and that their feedback is used to improve translation workflows.
When it comes to the role of individual clinicians, the authors acknowledge that poorly-written source texts may lead to translation errors and advise encouraging clinicians to prioritize “well-structured” notes “written in plain language.”
At the same time, they offer a “two-prompt approach” as an alternative. In this scenario, clinicians’ notes are refined using a “preparation” prompt before being translated by the LLM. Notably, they suggest using “an LLM’s zero-shot capabilities” for the preparation step despite acknowledging the risk of AI hallucination elsewhere in the paper.
The authors also recognize that poor quality AI translations could require a larger editing effort from translation staff than conventional translation.
To address this risk, they suggest that organizations first identify document types and language pairs that AI translation “consistently handles well,” and limit usage to those use cases in the early stages of implementation. For example, they suggest that a Spanish translation for “routine discharge summaries” can likely be handled with AI translation, while it may not be suitable for a Korean translation of surgery consent forms.
2025 Slator Pro Guide: Translation AI
The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.
Retrospective and Prospective Testing
The researchers posit that a “critical first step” before deploying any AI translation tool is “retrospective testing,” where the model’s ability to translate relevant medical documents is assessed using historical data in a secure environment.
This would be followed by “prospective testing,” where AI translation is used in small-scale pilot tests that would track translation quality and impact over a set period of time.
The researchers argue that healthcare systems could then leverage translator-approved AI translations as “gold-standard data” to iteratively fine-tune the model over time, improving its performance with different document types and language pairs.
To comprehensively verify the performance of AI translation in healthcare settings, the authors also propose a combination of translation quality evaluations and broader metrics for operational and clinical outcomes.
For linguistic evaluation, they propose that an institution’s translation team use systems like the Multidimensional Quality Metrics (MQM) framework to manually evaluate a representative sample of translations on a periodic basis, while using the automated chrF++ and COMET metrics for more frequent routine monitoring.
To evaluate operational results over time, the researchers suggest tracking translation turnaround times and the proportion of LEP patients who receive “language-concordant discharge instructions” to track efficiency. However, they do not propose methods for collecting these data.
At the same time, they also argue that clinical outcomes like readmissions and mortality rates can be tracked to “gauge the real-world effects of MAT on care quality.”
Authors: Ivan Lopez, David E. Velasquez, Jonathan H. Chen & Jorge A. Rodriguez