The benchmark also includes multiple types of text transformations, including AI translation or hybrid human-AI editing. The goal is to test how detection systems behave once text goes through such transformations.
Using the benchmark, the researchers evaluated several detection models and found significant differences in performance across languages.
The dataset categorizes languages into “big-head” languages — those with substantial training data (i.e., high-resource languages) — and “long-tail” languages, which have far fewer available resources (i.e., low-resource languages).
The researchers found that detection models performed substantially worse in low-resource languages, which have far less training data available. In some experiments, performance dropped significantly when models were applied to low-resource languages instead of widely used ones.
The researchers note that multilingual AI systems continue to reflect underlying data imbalances: systems trained largely on English and other high-resource languages tend to degrade when applied to smaller languages.
Translation and Rewriting Affect Detection
The benchmark also evaluates models on different types of authorship and text transformations. BLUFF includes four content categories: human-written text, AI-generated text, AI-translated text, and hybrid human–AI edited text.
These transformations are designed to reflect how synthetic or manipulated content can appear across languages.
In addition to language differences, the study shows that once text is AI-translated or edited (hybrid human-AI text), it becomes harder for AI systems to determine whether it was written by a human, generated by AI, translated by AI, or edited by both.
The findings highlight challenges for organizations deploying AI systems to analyze multilingual text — including tools used for brand monitoring, compliance checks, or customer feedback analysis.
Many AI detection systems are evaluated primarily in English or other high-resource languages. The study suggests that performance observed in those languages may not hold across languages or multilingual content pipelines.
For organizations operating across multiple languages, this study highlights the importance of testing AI systems in the languages they are expected to handle rather than assuming uniform multilingual performance.
The researchers have released the benchmark publicly to support further work on multilingual AI robustness and evaluation.
Authors: Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita, Dominik Macko, Ivan Srba, Robert Moro, and Dongwon Lee