They found that using training data from multiple source languages improved the accuracy of both monolingual and multilingual classifiers.
The researchers culled non-English source texts and their corresponding English HT and MT from WMT news shared tasks to create a data set. Monolingual classifiers were trained on English-only data, while multilingual classifiers were trained on both source texts and their English translations.
Compared to monolingual classifiers, multilingual classifiers had a higher rate of accuracy in identifying translations as human or machine, indicating that classifiers clearly benefited from access to source sentences.
Slator Machine Translation Expert-in-the-Loop Report
60-page report on the interaction between human experts and AI in translation production, including AI-enabled workflows, adoption rates, postediting, pricing models.
And experiments with German, Russian, and Chinese showed that training on multiple source languages improved classifier performance in other languages.
A Promising Direction
“There does seem to be a diminishing effect of incorporating training data from different source languages, though, as the best score is only once obtained by combining all three languages as training data,” the authors wrote. “Nevertheless, given the improved performance for even only small amounts of additional training data (Chinese has only 1,756 training instances), we see this as a promising direction for future work.”
The group also found that fine-tuning a sentence-level model on document-length text was impactful, and preferable to simply training models on documents rather than on sentences. Fine-tuning in this way led to the highest levels of accuracy and the lowest standard deviations, indicating more stable classifiers.
Looking ahead, as text generation continues to incorporate MT, the researchers wrote, it will likely become more difficult to distinguish original texts from translations. The next logical step in this line of research, then, will address classifiers that can identify text as original, HT, or MT.