New Benchmark Tests AI Detection Across Languages and Translation

AI systems used to detect synthetic or manipulated text may perform unevenly across languages and content transformations — such as AI translation or hybrid human-AI editing — can complicate detection, according to a February 28, 2026 study.

Researchers from Penn State University, MIT Lincoln Laboratory, Trinity College Dublin, the Kempelen Institute of Intelligent Technologies, and Visa Research introduced BLUFF (Benchmarking in LowresoUrce Languages for detecting Falsehoods and Fake news), a multilingual dataset designed to evaluate how well AI systems detect synthetic or manipulated text across languages.

The dataset spans 79 languages — 20 high-resource and 59 low-resource ones — and more than 200,000 samples, combining human-written fact-checked articles with AI-generated content.

The benchmark also includes multiple types of text transformations, including AI translation or hybrid human-AI editing. The goal is to test how detection systems behave once text goes through such transformations.

Performance Drops in Low-Resource Languages

Using the benchmark, the researchers evaluated several detection models and found significant differences in performance across languages.

The dataset categorizes languages into “big-head” languages — those with substantial training data (i.e., high-resource languages) — and “long-tail” languages, which have far fewer available resources (i.e., low-resource languages).

The researchers found that detection models performed substantially worse in low-resource languages, which have far less training data available. In some experiments, performance dropped significantly when models were applied to low-resource languages instead of widely used ones.

The researchers note that multilingual AI systems continue to reflect underlying data imbalances: systems trained largely on English and other high-resource languages tend to degrade when applied to smaller languages.

Translation and Rewriting Affect Detection

The benchmark also evaluates models on different types of authorship and text transformations. BLUFF includes four content categories: human-written text, AI-generated text, AI-translated text, and hybrid human–AI edited text.

These transformations are designed to reflect how synthetic or manipulated content can appear across languages.

In addition to language differences, the study shows that once text is AI-translated or edited (hybrid human-AI text), it becomes harder for AI systems to determine whether it was written by a human, generated by AI, translated by AI, or edited by both.

The findings highlight challenges for organizations deploying AI systems to analyze multilingual text — including tools used for brand monitoring, compliance checks, or customer feedback analysis.

Many AI detection systems are evaluated primarily in English or other high-resource languages. The study suggests that performance observed in those languages may not hold across languages or multilingual content pipelines.

For organizations operating across multiple languages, this study highlights the importance of testing AI systems in the languages they are expected to handle rather than assuming uniform multilingual performance.

The researchers have released the benchmark publicly to support further work on multilingual AI robustness and evaluation.

Authors: Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita, Dominik Macko, Ivan Srba, Robert Moro, and Dongwon Lee