The authors attribute these biases to gender stereotypes, underscoring the importance of defining and understanding them. Pikuliak, Hrckova, Oresko, and Šimko highlighted the multitude of gender stereotypes existing worldwide, varying across cultures, and pointed out the oversight in many previous works that treat stereotypes as singular entities. “Many previous works do not consider this and they work with the concept of stereotype as if it were a singular entity,” they said.
To address this, the Kempelen team employed a more fine-grained approach to study “which specific stereotypes were learned by the models and how strong the stereotypes are,” and they released GEST, a new dataset for measuring gender-stereotypical reasoning in English-to-X MT systems.
Strong Male-As-Norm Behavior
Using GEST, they evaluated Amazon Translate, DeepL, Google Translate, and NLLB200 revealing strong “male-as-norm” behavior, with Amazon Translate identified as “the most masculine system”, followed by Google Translate.
They also observed similar tendencies for gender-stereotypical reasoning across these systems, suggesting they might have learned from “very similar poisoned sources.” According to the authors, these systems “think” that women are beautiful, neat, and diligent, while men are leaders, professional, rough, and tough.
Having a better understanding of the MT systems’ behavior, the authors recommended a focused approach to address specific issues, such as preventing models from sexualizing women. “This might be more manageable compared to when gender bias is taken as one vast and nebulous problem,” they said.
Mitigating Gender Bias
The Microsoft researchers looked at measuring and mitigating gender bias in MT systems. They emphasized that gender bias in MT goes beyond sentences with ambiguous gender, extending to instances where gender can be inferred from the context, yet the MT output contradicts the gender information present in the source.
To address this, they proposed fine-tuning a base model using a gender-balanced in-domain dataset derived from the training corpus and introduced a novel domain-adaptation technique, leveraging counterfactual data generation methods.
The process involved selecting gendered sentences from the base model training corpus and generating counterfactuals by creating gender-swapped versions, with a specific focus on sentences containing masculine or feminine forms of profession animate nouns.
Slator Machine Translation Expert-in-the-Loop Report
60-page report on the interaction between human experts and AI in translation production, including AI-enabled workflows, adoption rates, postediting, pricing models.
The authors highlighted the advantages of their approach, noting its reliance on a subset of the in-domain training corpus for fine-tuning data generation to avoid catastrophic forgetting otherwise seen during domain adaptation. They stressed its purely data-centric nature, requiring no modifications to training objectives or additional decoding models. Additionally, they underlined the utilization of counterfactual data generation techniques, providing a dynamic and diverse dataset during model training.
Accuracy Improvements
The evaluation of their approach, conducted using the WinoMT test set tailored for profession words, demonstrated significant accuracy improvements for Italian, Spanish, and French.
“We achieve 19%, 23%, and 21.6% […] accuracy improvements over the baseline for Italian, Spanish, and French respectively, without significant loss in general translation quality,” they said.
The authors concluded by highlighting potential directions for future work, including extending techniques to address non-binary gender and improving the handling of complex sentences involving multiple individuals, “where different entities get gender-swapped in the source and target.”