RWS’s TrainAI Team Benchmarks LLMs for Multilingual Synthetic Data

The study evaluates eight frontier models — including GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, DeepSeek V3.1, Llama 4 Maverick, Mistral Medium 3.1 and Small 3.2, and Qwen3 235B — across four tasks — domain-specific paragraph generation, conversation generation, text normalization (standardizing spoken-form text), and translation — and eight languages (English, French, Chinese, Arabic, Polish, Tagalog, Tamil, and Kinyarwanda).

According to TrainAI, the evaluation covered 25,600 samples, with outputs assessed by 120 linguists across more than 200,000 individual ratings.

No Single Best Model

One of the key findings is that no single model performs best across all tasks.

“The most important takeaway is that there is no single ‘best’ LLM for multilingual synthetic data generation,” said Tomáš Burkert, Head of Innovation at TrainAI by RWS. “The right model depends entirely on your tasks, languages and cost priorities,” he added.

Models that perform strongly in content generation — such as domain-specific paragraph or conversation creation — do not necessarily achieve the same results in text normalization or translation.

According to RWS’s TrainAI team, Gemini 2.5 Pro ranked highest overall, followed by Claude 4.5 Sonnet and DeepSeek V3.1.

At the task level, however, performance diverged. Claude 4.5 Sonnet and Gemini 2.5 Pro led in domain-specific paragraph generation, with Claude achieving the highest overall scores and Gemini showing more consistent performance across languages. Gemini 2.5 Pro also led conversation generation, with particularly strong multilingual results.

In text normalization, GPT-5 ranked slightly ahead of Gemini 2.5 Pro, while Gemini showed more stable performance across languages. For translation, Gemini 2.5 Pro emerged as the top model overall, with GPT-5 as a strong alternative.

The inclusion of lower-resource languages such as Kinyarwanda highlights how model performance varies beyond high-resource settings. TrainAI reports improvements compared to earlier benchmarks — with several frontier models now producing high-quality outputs in languages where previous generations struggled. However, results remain uneven, with differences between models becoming more pronounced depending on the task and language. In particular, Gemini 2.5 Pro showed more stable performance across languages, while other models — including Claude 4.5 Sonnet and GPT-5 — exhibited greater variability in long-tail settings.

“While challenges remain, the current generation of models signals that synthetic data generation and translation are becoming viable for a far broader range of languages than ever before,” TrainAI by RWS noted.

Slator Data-for-AI Market Report

This 160-page Slator Report provides a comprehensive view of the emerging global market for Data-for-AI with analysis of datasets, buyer demand, supplier dynamics, and data production.

$890 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

Operational Factors: Structure, Cost, and Reliability

Beyond output quality, the study highlights operational factors affecting the reliability and scalability of synthetic data generation.

Schema adherence — the ability to follow structured output formats — emerged as a key differentiator. While most models handled structured outputs reliably in tasks such as conversation generation and translation, DeepSeek V3.1 and Llama 4 Maverick showed repeated difficulties adhering to output formats, requiring additional prompt adjustments and post-processing. In text normalization, Mistral Medium 3.1 and Mistral Small 3.2 often failed to preserve the required output structure, limiting batch processing.

Tokenizer efficiency — measured as characters per token — also varied across models, with direct cost implications. Gemini 2.5 Pro ranked as the most efficient, followed by GPT-5, while Claude 4.5 Sonnet was the least efficient, particularly in non-Latin scripts. These differences are amplified in reasoning models, where longer outputs increase token consumption and overall cost, RWS’s TrainAI team notes.

Finally, TrainAI examined output variability. In content generation tasks, GPT-5 produced the most lexically diverse outputs, while Gemini 2.5 Pro showed the most consistent performance across languages.

Synthetic Data Expands, but Human Expertise Remains Central

RWS’s TrainAI study highlights synthetic data generation as a way to address persistent constraints in sourcing real-world multilingual datasets, particularly where data is scarce, sensitive, or costly to obtain. TrainAI points to uneven web representation, copyright restrictions, and privacy concerns as key drivers.

At the same time, the results illustrate how this shift affects workflows. While RWS’s TrainAI team found that top-performing models can deliver better outputs than humans under specific constrained conditions, it emphasizes that “this doesn’t render human expertise obsolete.”

Instead, synthetic data generation and model outputs depend on high-quality human data to guide, validate, and evaluate results. In practice, this creates a feedback loop in which models produce strong initial outputs — including synthetic datasets — and “humans add value through review, refinement, and specialized judgment,” they said.

This aligns with Slator’s Data-for-AI Market Report, which finds that synthetic data expands coverage but continues to rely on human input for alignment, evaluation, and deployment — often increasing, rather than reducing, the importance of human expertise.

TrainAI by RWS notes that the findings reflect current model performance in a fast-moving market, where new releases can quickly change rankings and capabilities.

RWS’s TrainAI team also recommends evaluating models against specific use cases rather than relying on overall rankings, re-evaluating performance regularly, and treating synthetic data as one piece of a broader AI data strategy.

Competitive advantage is increasingly tied to data — particularly the ability to generate, evaluate, and adapt it for real-world use — and the TrainAI study illustrates how this shift is playing out in practice, with LLMs becoming part of data production workflows that combine synthetic generation with human oversight.

RWS’s TrainAI Team Benchmarks LLMs for Multilingual Synthetic Data

No Single Best Model

Slator Data-for-AI Market Report

Operational Factors: Structure, Cost, and Reliability

Synthetic Data Expands, but Human Expertise Remains Central

Featured

Boost Language Access

AI should speak every language

memoQ Translation Tech

Leading with Excellence