Academia and Startup Explore How to Preserve Voice, Emotion in AI Speech Translation

“The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style,” researchers at the Hong Kong University of Science and Technology (HKUST) and startup Soul AI noted in a September 25, 2025, paper.

To achieve this, they introduced UniSS (Unified Expressive Speech-to-Speech Translation) — a system designed to preserve the speaker’s voice, tone, and emotion in S2ST.

The researchers claim UniSS marks “a simpler and more effective paradigm” for expressive S2ST, outperforming both cascaded and end-to-end systems on translation accuracy, speech naturalness, voice and emotion preservation, and duration consistency.

They explained that traditional S2ST systems typically follow a cascaded approach, chaining together automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS) components. While functional, this multi-step process often passes along and amplifies errors between stages and struggles to keep the speaker’s natural tone and rhythm.

Recent end-to-end systems improve on that but remain complex, and many don’t fully take advantage of the translation abilities of large language models (LLMs), they added.

“Progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models.” — Cheng et al.

A Unified Model That Listens, Translates, and Speaks

Built upon the Qwen2.5-1.5B-Instruct model (a text-based LLM), UniSS performs the entire process in a single, unified system.

It uses speech tokenizers that turn audio into language-like units the model can understand, and speech decoders that convert the translated result back into natural speech.

A cross-modal chain-of-thought prompting method guides the model to reason step-by-step: first listening to the source speech, then translating it as text inside the model, and finally speaking it back in another language. 

The system also breaks down speech into three key elements — who is speaking (voice characteristics), what they’re saying (content), and how they’re saying it (emotion and style) — helping it preserve the speaker’s identity and emotion while accurately translating their words.

UniSS runs in two modes: quality mode — which prioritizes accuracy by following the complete step-by-step process — and performance mode — which speeds up translation by skipping some intermediate steps while maintaining good quality.

Recognizing that high-quality expressive S2ST training data is scarce, the researchers also built UniST, a large-scale Chinese-English dataset comprising 44.8k hours of parallel speech data, that can be used for training models like UniSS.

More Natural and Emotion-Preserving Speech

UniSS was tested against a range of top systems — both cascaded and end-to-end — including Meta’s Seamless-Expressive, Bytedance’s Seed LiveInterpret 2.0, OpenAI’s GPT-4o, and Alibaba’s Qwen2.5-Omni. Across both automatic and human evaluations, it delivered higher translation accuracy and more natural, emotion-preserving speech.

While UniSS currently supports Chinese-English translation pairs, the researchers say their data construction pipeline and training framework can be extended to multilingual scenarios.

“Our work demonstrates a simple and effective approach for building the next generation of expressive S2ST systems,” they concluded.

Audio samples and implementation details are available on the project’s demo site and GitHub repository.

Authors: Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue