As an added bonus, the researchers found they could adjust the foreign accent of the synthesized voice to make it sound more native, alleviating a known issue in cross-lingual TTS. The model was even able to handle code-switching fluently despite a lack of examples in the training data. This mixing of multiple languages in speech, which is characteristic of many multilingual communities, is a tricky area for traditional TTS.
Microsoft has so far disclosed little about its intentions for VALL-E X and has not yet released the code.
Readers may judge the quality of the generated speech by listening to the demo. It includes examples of the model’s performance on a range of speech generation tasks, from cloning the voice from an input prompt when converting text to speech in a different language to speech-to-speech translation, foreign accent control, emotion maintenance, and synthesis of code-switching utterances.
Unexpected Arrival
For now, there is just a single language pair, Chinese↔English, but the next step is to expand to other languages. It should be straightforward for high-resource languages. One of the major benefits of Microsoft’s novel approach is that it does not require hard-to-source paired bilingual data from the same speakers for training. For low-resource languages, leveraging knowledge from high-resource languages with transfer learning and data augmentation or language agnostic meta learning could yield promising results.
The research paper itself reveals very little about Microsoft’s intentions for VALL-E X, and for now, they have not released the code. There are many potential uses, starting with the obvious speech-to-speech translation tasks that big tech has been working on for years — multilingual communication and machine dubbing, but with the bonus of preserving the voice of the original speaker.
Along with the likes of Meta, Google, and Amazon, a whole raft of startups have sprung up specializing in this space. ElevenLabs is one intriguing example. Despite receiving a lot of negative press earlier this year when it released its voice-cloning tool to beta testers, there is no denying the technology is impressive. In fact, it is precisely for that reason it was so effectively abused to create deepfakes of celebrities spewing hate speech, spoof voice ID to break into a bank account, and spawn a new meme of US presidents trash-talking while gaming.
Although the unexpected arrival of VALL-E X may have somewhat stolen its thunder, ElevenLabs plans to release its “automatic dubbing tools that let you speak the language you don’t” later this year. These days, it seems that the next big AI breakthrough is never far off.