Moreover, “existing gender evaluation benchmarks have limited diversity in terms of gender phenomena (e.g., focusing on professions), sentence structure (e.g., using templates to construct sentences), or language coverage” — making it even more challenging to assess how MT systems perform in terms of both gender and quality at the same time, according to the research paper describing the benchmark.
The paper was presented at the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Realistic Dataset
MT-GenEval is a large, realistic evaluation set that contains translations from English into eight diverse and widely-spoken languages: Arabic, French, German, Hindi, Italian, Portuguese, Russian, and Spanish.
Unlike commonly used gender bias test sets that are artificially constructed, MT-GenEval dataset is based on real-world data obtained from Wikipedia and includes professionally created reference translations in each of the languages.
Furthermore, it is fully balanced by including human-created gender counterfactuals. “This type of balancing ensures that differently gendered subsets do not have different meanings,” the Amazon researchers explained in the same blog post.
Apart from the 1,150 segments of evaluation data per language pair, 2,400 parallel sentences for training and development were released.
Gender Translation Accuracy
Gender accuracy in translation is defined as “the extent to which a machine translation output accurately reflects the gender of the humans mentioned in the input, restricted to cases where the gender is explicitly and linguistically disambiguated in the context of the input.”
Slator Machine Translation Expert-in-the-Loop Report
60-page report on the interaction between human experts and AI in translation production, including AI-enabled workflows, adoption rates, postediting, pricing models.
Therefore, in their benchmark, the researchers do not take into account the grammatical gender on inanimate objects, or instances in which the input gender is ambiguous within the given context.
The benchmark has proven to be useful for evaluating both commercial and research systems, including contextual machine translation models and gender-balanced models, in terms of gender accuracy as well as quality.
“MT-GenEval is a step forward for the evaluation of gender accuracy in machine translation,” the Amazon researchers said. “We hope that this benchmark and development data will spur more research in the field of gender accuracy in translation on diverse languages,” they concluded.