Voice Cloning Meets Emotional Speech Synthesis With Alibaba’s Marco-Voice Model

Alibaba researchers have unveiled Marco-Voice, a new text-to-speech (TTS) system that brings together voice cloning and emotional speech synthesis in a single framework.

With Marco-Voice, Alibaba aims to “address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts.” 

The team highlighted three persistent challenges in expressive speech synthesis: the difficulty of separating emotion and speaking style (many TTS models mix speaker-specific emotion with prosodic style, making it hard to control voice identity and manner of speaking independently), trade-offs between natural prosody and emotional consistency, and the limits of treating emotions as simple discrete categories. 

These issues have traditionally restricted the personalization and realism of synthetic voices.

Marco-Voice introduces mechanisms to separate emotion from voice identity, more flexible ways of representing emotions, and methods to integrate emotional cues with linguistic content throughout speech generation. Together, these innovations allow for greater control over both who is speaking and how they are speaking.

New Dataset Powers Advances

Marco-Voice is also supported by a new dataset, CSEMOTIONS, comprising about ten hours of Mandarin emotional speech recorded by professional actors across seven emotional categories: neutral, happy, angry, sad, surprise, playfulness, and fearful. 

Combined with existing English and Mandarin corpora, the dataset enabled Marco-Voice to outperform earlier models such as Alibaba’s CosyVoice in both voice cloning and emotional expressiveness.

“By integrating speaker identity, emotional style, and linguistic content within a single framework, our system achieves superior speech quality and emotional richness while expanding the potential applications of TTS technology in multilingual and interactive environments,” the researchers said.

More Expressive and Personalized Speech Synthesis

Extensive experiments showed that Marco-Voice improves both the technical quality and the expressiveness of synthetic speech. Evaluations confirmed advances in clarity, emotional richness, and speaker similarity, while listeners preferred Marco-Voice over CosyVoice.

Performance gains were consistent across Mandarin and English, though some differences emerged: shorter audio prompts (1–3 seconds) produced better results than longer or very short ones, and female speakers scored higher on emotion recognition than male speakers, an area for further investigation.

The researchers argue that unifying voice cloning and emotional expression allows models to “learn the subtle interactions between speaker characteristics and emotional expressions,” producing more consistent, natural outputs.

“This work represents an important step toward more expressive and personalized speech synthesis,” they said.

Still, challenges remain: paired emotional data is costly to collect, and real-time deployment requires further optimization.

Marco-Voice’s code, data, and demos are publicly available on GitHub and HuggingFace, with the team inviting contributions from the research community.

Authors: Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, and Kaifu Zhang