-
Solutions
-
Speech
Speech intelligence that passes the human test.
Frontier voice synthesis and transcription for natural, expressive, nuanced interactions.
Powered by state-of-the-art, open voice models.
Voxtral TTS.
Realistic, emotionally expressive voice generation and cloning.
Voxtral Mini Transcribe 2.
Batch transcription with speaker diarization and context biasing.
Voxtral Realtime.
Live streaming transcription with sub-200ms latency.
Why Mistral Speech?
Get started with Mistral Speech.
API.
Programmatic access to Mistral's audio models for custom integrations.
Playground.
Test voice generation, cloning, and transcription in Mistral Studio.
Enterprise.
Custom deployments, solutions, model training, and dedicated support.
Closing the loop on audio intelligence.
Construct and customize voices.
Voice agents.
Real-time voice-to-voice conversations that listen, reason, and respond with your brand's voice, tone, and domain knowledge.
Voice cloning.
Replicate any voice from a sample as short as 3 seconds, capturing tone, rhythm, and personality.
Text-to-speech.
Emotionally expressive speech and voice cloning that captures a speaker's personality. Adapt to any voice from a short sample, or use preset voices.
Capture every word.
Real-time transcription.
Streaming architecture that transcribes as audio arrives, not in chunks, with latency configurable down to sub-200ms.
Batch transcription.
Process hours-long meetings, call recordings, and compliance archives, with structured outputs and speaker attribution.
Cross-lingual adaption.
Generate speech in one language using a voice from another, preserving accent and identity.
Prototype, test, tune, adapt.
Audio playground.
Test conversations, voice generation, and transcription in Mistral Studio with actors, voice emulation, diarization, and per-input controls.
Speaker diarization.
Identifies who said what and when, with speaker labels and start/end timestamps for meetings, interviews, and multi-party calls.
Context biasing.
Guide the model with up to 100 custom terms: names, technical vocabulary, internal jargon.
How teams use Mistral Speech today.
Resources.
Documentation.
News.
Community.
Frequently asked questions.
Try voice generation and transcription in the Audio Playground in Mistral Studio, integrate via API, or download open weights to self-host.
Yes, two. Voxtral Mini Transcribe 2 for batch transcription with speaker diarization and context biasing and Voxtral Realtime for live streaming transcription with sub-200ms latency.
Yes, you can provide a voice sample as short as 3 seconds and Voxtral TTS, our text-to-speech model, will adapt to capture the speaker's tone, rhythm, and personality. You can also use preset voices or build your own voice library.
Transcription supports 13 languages including English, French, German, Spanish, Chinese, Hindi, Arabic, Portuguese, Russian, Japanese, Korean, Italian, and Dutch. Voice generation supports 9 languages with dialect-aware expressions in English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
An example of combining Voxtral models to form a speech-to-speech pipeline is: Voxtral Realtime transcribes incoming speech, another Mistral LLM reasons over the transcript and determines a response, and Voxtral TTS generates spoken output. Each component is independently customizable and deployable. Cross-lingual voice adaptation means the pipeline can also handle live translation while preserving the speaker's accent and identity.
Yes, you can self-host Mistral Speech or deploy on Mistral Compute.




