These issues have traditionally restricted the personalization and realism of synthetic voices.
Marco-Voice introduces mechanisms to separate emotion from voice identity, more flexible ways of representing emotions, and methods to integrate emotional cues with linguistic content throughout speech generation. Together, these innovations allow for greater control over both who is speaking and how they are speaking.
New Dataset Powers Advances
Marco-Voice is also supported by a new dataset, CSEMOTIONS, comprising about ten hours of Mandarin emotional speech recorded by professional actors across seven emotional categories: neutral, happy, angry, sad, surprise, playfulness, and fearful.
Combined with existing English and Mandarin corpora, the dataset enabled Marco-Voice to outperform earlier models such as Alibaba’s CosyVoice in both voice cloning and emotional expressiveness.
“By integrating speaker identity, emotional style, and linguistic content within a single framework, our system achieves superior speech quality and emotional richness while expanding the potential applications of TTS technology in multilingual and interactive environments,” the researchers said.
More Expressive and Personalized Speech Synthesis
Extensive experiments showed that Marco-Voice improves both the technical quality and the expressiveness of synthetic speech. Evaluations confirmed advances in clarity, emotional richness, and speaker similarity, while listeners preferred Marco-Voice over CosyVoice.
Performance gains were consistent across Mandarin and English, though some differences emerged: shorter audio prompts (1–3 seconds) produced better results than longer or very short ones, and female speakers scored higher on emotion recognition than male speakers, an area for further investigation.
2025 Slator Pro Guide: Translation AI
The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.
The researchers argue that unifying voice cloning and emotional expression allows models to “learn the subtle interactions between speaker characteristics and emotional expressions,” producing more consistent, natural outputs.
“This work represents an important step toward more expressive and personalized speech synthesis,” they said.
Still, challenges remain: paired emotional data is costly to collect, and real-time deployment requires further optimization.
Marco-Voice’s code, data, and demos are publicly available on GitHub and HuggingFace, with the team inviting contributions from the research community.
Authors: Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, and Kaifu Zhang