In an ELITR demo video, Charles University Associate Professor, Ondřej Bojar, said the project also looks into the possibility of “going directly from the source speech into the target language with an end-to-end spoken language translation system.”
In short, speech-to-speech translation (S2ST). For ELITR, however, Bojar told Slator, “We stop at the target text. We are not including the final text-to-speech — although we definitely could.”
Slator 2021 Data-for-AI Market Report
44-pages on how LSPs enter and scale in AI Data-as-a-service. Market overview, AI use cases, platforms, case studies, sales insights.
S2ST has become a sort of brass ring in research and big tech — as tackled by the likes of Apple, Google (via the so-called “Translatotron”; SlatorPro), and prominent Japanese researchers, who uploaded a toolkit for it on GitHub. Chinese search giant Baidu even drew some flack for claims around it; and, of course, there is a whole graveyard of translation gadgets from companies that tried to commercialize S2ST.
Admittedly, ELITR’s production pipeline currently relies on two independent steps — that is, automatic speech recognition (ASR) and machine translation (MT) and, according to Bojar “we are actually quite good in these two steps” (as evidenced by a paper published on June 17, 2021; and two others published in September and October 2020).
“We’re also investigating the possibilities of going directly from the source speech into the target language with an end-to-end spoken language translation system” — Ondrej Bojar, Associate Professor, Charles University
End-to-end speech translation is part of the long-term vision, as outlined in a recent paper published on the Association for Computational Linguistics portal. “The goal of a practically usable simultaneous spoken language translation (SLT) system is getting closer,” wrote the authors from Charles University, Karlsruhe Institute of Technology, the University of Edinburgh, and Italy-based automatic speech recognition (ASR) provider PerVoice. SLT also encompasses off-line spoken language systems, the authors said.
The authors (Bojar, among them) mentioned two problems of the current system that have yet to be solved.
- Intonation – which cannot be factored in as punctuation prediction has no access to sound; and
- Segmentation errors – that is, MT systems tending to “normalize word order,” thus reducing fluency in a stream of spoken sentences.
Hence, “for the future, we consider three approaches,” Bojar, et al. added: (1) training MT on sentence chunks, (2) including sound input in punctuation prediction, or (3) end-to-end neural SLT.”
Working alongside Charles University on ELITR were the University of Edinburgh, and Karlsruhe Institute of Technology. ASR provider, PerVoice, and Germany-based video conferencing platform, alfaview, also participated in the project. Does this mean commercialization plans are on the drawing board?
Bojar told Slator, “For a research institute at a university, commercialization is always something that takes an unbearably long time, but we are definitely very open to many forms of collaboration.”