Fondazione Bruno Kessler researchers Matteo Negri, Marco Turchi, and Marco Gaido (who is also affiliated with Italy’s University of Trento) explain in their May 2022 paper, Who Are We Talking About? Handling Person Names in Speech Translation, that personal name errors typically stem from names that appear infrequently in training data, as well as from a lack of training data in the language of the “referent name.”
ASR and ST models trained on English audio try to force every sound to match English words, which can distort personal names from other languages.
Generalize and Disambiguate
“Current solutions rely on predefined dictionaries to identify and translate the elements of interest,” the authors wrote, preventing these solutions from generalizing and disambiguating homophones or homonyms.
In order to be useful, an ST or ASR system would need to “reliably recognize and translate [named entities] and terms, without generating wrong suggestions,” the authors explained.
With the long-term goal of integrating ST models into assistant tools for live interpreting, the group created multilingual models, trained with audio in different languages, to produce transcripts and translations into Spanish, French, and Italian.
Even though 80% of the total training data was in English, adding audio from another language to the corpus helped to correct the handling of personal names in that language by 48% on average, producing useful translations for interpreters in 66% of cases.
In addition to incorporating data from other languages, the researchers also added the referent names, finding that the more frequently a name appeared, the more likely the system would transcribe the name correctly.
The study noted, “On average, names occurring at least three times in the training set are correctly generated in slightly more than 50% of the cases, a much larger value compared to those with less than three occurrences.”
Still, confusing or distracting transcriptions of personal names accounted for 15% of the results, leaving room for future research to examine what level of accuracy would be required to help interpreters in action — and figure out how to attain it.