Voxtral TTS supports nine languages — English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic — and can generate speech from as little as three seconds of reference audio, enabling zero-shot voice cloning, according to the company. Mistral also notes that the model can follow characteristics of a reference voice, including intonation, rhythm, and emotional delivery, without requiring explicit prosody or emotion tags.
Mistral positions Voxtral TTS as a lightweight model at around 4B parameters, designed for low-latency applications such as voice agents and streaming use cases. The company also emphasizes deployment flexibility, including the ability to run models locally or on-premise for scenarios where latency, data control, or regulatory requirements are important.
In a technical session on April 2, Mistral demonstrated how the model family can be used in a real-time translation workflow, combining Voxtral Mini Transcribe Realtime for speech-to-text, Mistral Large for translation, and Voxtral TTS for speech generation. In the demo, speech was transcribed as it was spoken, translated, and then generated back as audio. Real-time translation is one of the use cases highlighted by the company in its documentation and technical session.
Slator Data-for-AI Market Report
This 160-page Slator Report provides a comprehensive view of the emerging global market for Data-for-AI with analysis of datasets, buyer demand, supplier dynamics, and data production.
Mistral also published benchmark results comparing Voxtral TTS with competing systems. In automatic evaluations, results vary across languages and metrics. Voxtral TTS achieves higher speaker similarity scores, while ElevenLabs models — ElevenLabs Flash v2.5 and ElevenLabs v3 — show comparable or, in some cases, stronger results on other metrics, including intelligibility and naturalness.
In human evaluations, results also vary by task. Voxtral TTS is broadly on par with ElevenLabs models in controlled emotional speech and performs better when emotion is inferred from text. In zero-shot voice cloning, it achieves higher win rates, with a 68.4% preference over ElevenLabs Flash v2.5.
As with all vendor-reported benchmarks, these results should be read as company-reported performance rather than independent validation.
The release also marks a shift in licensing. Earlier Voxtral models were released as open weights under the Apache 2.0 license, which allows commercial use. Voxtral TTS, by contrast, is released under a CC BY-NC 4.0 license, allowing research and non-commercial use, with commercial access provided through Mistral’s API, priced at USD 0.016 per 1,000 characters of generated audio. Mistral’s Hugging Face materials indicate that the open-weight version is also limited to fixed voices, with voice customization available only through its platform.
Early developer reactions on forums such as Reddit focused on the limitations of the open-weight release, particularly the absence of full voice customization. Licensing was also a point of discussion, with the CC BY-NC 4.0 release seen as more restrictive than earlier Voxtral models and potentially limiting commercial use. Some users also questioned performance comparisons with proprietary systems such as ElevenLabs.