Microsoft Says New Voice Conversion Feature Will Improve AI Dubbing

On June 27, 2025, Microsoft announced a new voice conversion feature in Azure AI Speech — currently in review — that allows users to transform a speaker’s recorded voice into a different AI-generated voice, without altering the original speech’s rhythm, tone, and emotion.

Initially available in the Eastern US, Western Europe, and Southeast Asia, the feature supports 28 EN-US voices, many of which are also used in Microsoft’s text-to-speech services. 

But unlike traditional text-to-speech, voice conversion requires no text input. Instead, it repurposes the intonation and expressiveness of an original audio clip, replacing only the voice identity.

While Microsoft has long supported synthetic speech, the new feature takes a step further: it shifts the process from text-to-speech to speech-to-speech.

‘Consistent Experience’ in Multilingual Dubbing

Among Microsoft’s key use cases is multilingual dubbing. Localized audio content often varies in voice quality and style across languages. With voice conversion, Microsoft offers a potential solution by enabling the conversion of all dubbed audio into a single, consistent target voice, “ensuring a consistent experience across all languages.

Microsoft says its system outperformed a leading competitor in internal tests, especially in Mandarin, where it delivered clearer and more natural-sounding speech. Performance in English was on par.

Voice conversion is also being added to Microsoft’s Custom Voice offering, now in private preview. This allows companies to apply voice conversion to their own branded synthetic voices, preserving the tone and emotion of the original audio while using a familiar voice identity. It requires only a small amount of training data, making it a “quick solution for dynamic voice customization,” according to Microsoft.

Microsoft has published implementation details and technical guidance for users interested in exploring the feature.