Scaling Speech Technologies
The USM is mainly for use in YouTube (e.g., for generating closed captions) and can perform automatic speech recognition (ASR) in over 100 languages. This extends beyond widely-spoken languages like English and Mandarin, to include under-resourced languages such as Amharic, Cebuano, Assamese, and Azerbaijani, among others.
For some of these languages it is “very hard to find the necessary training data,” because they “are spoken by fewer than twenty million people,” Google explained. This is “a fundamental challenge in scaling speech technologies to many languages,” they added.
The USM uses a typical encoder-decoder architecture, and the training process has three steps. The first step involves self-supervised learning on speech audio covering hundreds of languages. In the second step, the model’s quality and language coverage can be further improved through pre-training with text data (this is optional depending on the availability of text data, but the USM performs better when this second step is included, Google said). The last step fine-tunes the model on specific tasks — such as ASR or automatic speech translation (AST) — using only a small amount of supervised data.
Google demonstrates that pre-training the encoder of the model using a large unlabeled multilingual dataset and then fine-tuning it on a smaller set of labeled data can help in identifying under-represented languages. They also claim that the proposed training process is “effective at adapting to new languages and data,” they said.
Accessibility and Inclusion
The USM seems important to the search giant. In the post, Google says that they “believe USM’s base model architecture and training pipeline comprise a foundation on which we can build to expand speech modeling to the next 1,000 languages”.
In addition, Jeff Dean, SVP of Google Research and machine learning legend, wrote in a Tweet that “it will likely improve over other speech systems” as well.
The USM achieved state-of-the-art performance on multilingual ASR and AST for multiple datasets in multiple domains. More specifically, Google compared the USM against public pipelines, including Whisper, and found that the USM outperforms Whisper, achieving a lower word error rate (WER).
(Research paper authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu)