Study Finds Generic Reasoning Can Hurt AI Translation

Recent advances in large language models (LLMs) have introduced systems that generate step-by-step reasoning before producing answers. This approach has been shown to improve performance in tasks such as mathematics, coding, and logical problem solving.

In AI translation, reasoning-enabled models are also performing well. At the WMT25 General Machine Translation Shared Task — one of the key AI translation benchmarks — Google’s Gemini 2.5 Pro ranked first in most language pairs. According to the organizers, it was also the only participating system with reasoning enabled.

However, a new study found that reasoning models did not improve translation quality without purpose-built workflows and additional structure. In a February 16, 2026 paper, researchers from the University of Amsterdam and Cohere tested whether prompting models to explain their reasoning before translating leads to better results. 

The researchers evaluated four reasoning-capable models: Cohere’s Command-A-Reasoning, Anthropic’s Claude 4 Opus, DeepSeek-R1, and Google’s Gemini 2.5 Flash across nine language pairs.

For each model, they compared two approaches:

  • Direct translation, where the model translates the sentence immediately
  • Reasoning-first translation, where the model first produces a step-by-step explanation of the translation process before generating the final translation

Across models and almost all language pairs — except for Farsi —, the direct translation approach achieved better results than the reasoning-first approach. Translation quality was evaluated using the XCOMET-XL metric — with the researchers acknowledging that human evaluation could provide additional support for their findings.

Why Generic Reasoning Falls Short

To understand the performance difference, the researchers analyzed the reasoning traces generated by the models.

They found that the outputs were largely linear and descriptive. Instead of exploring alternative translations or revising earlier decisions, the models typically produced explanations that walked through the translation process step by step.

The findings suggest that the chain-of-thought reasoning that improves analytical tasks does not necessarily transfer to language generation tasks such as translation.

Task-Structured Reasoning Works Better

In a separate experiment, the researchers tested a structured reasoning process designed specifically for translation and found that a more structured approach did help.

Instead of asking models to reason freely, they designed a workflow resembling a translation revision process: generate a draft translation, improve adequacy by correcting meaning errors, refine fluency and grammar, and then produce the final version.

With this structured reasoning approach, translation quality improved beyond both the base reasoning model and a strong direct-translation baseline.

“Our findings demonstrate that reasoning must be task-structured to benefit MT,” they said, adding that reasoning should be “explicitly shaped around error revision, constraint satisfaction, and iterative refinement, rather than leaving it up on model for generic reasoning.”

The results suggest that translation may benefit less from generic reasoning and more from structured reasoning processes — closer to how human translators draft and refine text. As we are moving toward multi-step and agent-based approaches, improvements in AI translation may come less from models that think longer and more from systems designed to iteratively draft and revise translations.

The results may also raise questions about how vendors position reasoning models for language tasks, particularly as reasoning modes often come with significantly higher inference costs.

Authors: Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, and Tom Kocmi