Mistral Completes Voxtral Speech Stack With Launch of Text-to-Speech Model

On March 26, 2026, the French AI company Mistral released Voxtral TTS, adding text-to-speech (TTS) to its Voxtral model family and expanding the lineup into speech generation.

The release builds on the company’s July 2025 launch of Voxtral, a family of models designed to process and understand speech, supporting transcription, speech-to-text translation, summarization, and question-answering over audio. This was followed by updates aimed at improving automatic speech recognition (ASR) performance, including the February 2026 introduction of Voxtral Transcribe 2, with both batch and real-time transcription variants.

With the addition of Voxtral TTS, the Voxtral model family now spans speech input, language understanding, and speech output, positioning it as a more complete speech workflow stack. In practical terms, this means Voxtral can support end-to-end speech workflows, whether within the same model family or as part of existing speech-to-text and language model pipelines.

Voxtral TTS supports nine languages — English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic — and can generate speech from as little as three seconds of reference audio, enabling zero-shot voice cloning, according to the company. Mistral also notes that the model can follow characteristics of a reference voice, including intonation, rhythm, and emotional delivery, without requiring explicit prosody or emotion tags.

Mistral positions Voxtral TTS as a lightweight model at around 4B parameters, designed for low-latency applications such as voice agents and streaming use cases. The company also emphasizes deployment flexibility, including the ability to run models locally or on-premise for scenarios where latency, data control, or regulatory requirements are important.

In a technical session on April 2, Mistral demonstrated how the model family can be used in a real-time translation workflow, combining Voxtral Mini Transcribe Realtime for speech-to-text, Mistral Large for translation, and Voxtral TTS for speech generation. In the demo, speech was transcribed as it was spoken, translated, and then generated back as audio. Real-time translation is one of the use cases highlighted by the company in its documentation and technical session.

Slator Data-for-AI Market Report

This 160-page Slator Report provides a comprehensive view of the emerging global market for Data-for-AI with analysis of datasets, buyer demand, supplier dynamics, and data production.

$890 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

Performance, Access, and Early Developer Feedback

Mistral also published benchmark results comparing Voxtral TTS with competing systems. In automatic evaluations, results vary across languages and metrics. Voxtral TTS achieves higher speaker similarity scores, while ElevenLabs models — ElevenLabs Flash v2.5 and ElevenLabs v3 — show comparable or, in some cases, stronger results on other metrics, including intelligibility and naturalness.

In human evaluations, results also vary by task. Voxtral TTS is broadly on par with ElevenLabs models in controlled emotional speech and performs better when emotion is inferred from text. In zero-shot voice cloning, it achieves higher win rates, with a 68.4% preference over ElevenLabs Flash v2.5.

As with all vendor-reported benchmarks, these results should be read as company-reported performance rather than independent validation.

The release also marks a shift in licensing. Earlier Voxtral models were released as open weights under the Apache 2.0 license, which allows commercial use. Voxtral TTS, by contrast, is released under a CC BY-NC 4.0 license, allowing research and non-commercial use, with commercial access provided through Mistral’s API, priced at USD 0.016 per 1,000 characters of generated audio. Mistral’s Hugging Face materials indicate that the open-weight version is also limited to fixed voices, with voice customization available only through its platform.

Early developer reactions on forums such as Reddit focused on the limitations of the open-weight release, particularly the absence of full voice customization. Licensing was also a point of discussion, with the CC BY-NC 4.0 release seen as more restrictive than earlier Voxtral models and potentially limiting commercial use. Some users also questioned performance comparisons with proprietary systems such as ElevenLabs.

Featured