Skip to content

Speech models

mistral.rs supports two speech-related model families:

  • Voxtral: multimodal model accepting audio input. Used for transcription and audio understanding through /v1/chat/completions. It uses a Whisper-style audio encoder.
  • Dia: dedicated text-to-speech model served via /v1/audio/speech.

Voxtral is classified as a multimodal model (audio is one of its input modalities); Dia is classified as a dedicated speech model.

Terminal window
mistralrs serve -m mistralai/Voxtral-Mini-3B-2507

-m alone is enough: the auto-loader detects Voxtral's native Mistral layout (params.json, consolidated.safetensors, tekken.json).

Voxtral fits the multimodal chat shape: audio is an input content part, the response is text. The text prompt selects the task: transcription, summarization, speaker analysis, etc.

Terminal window
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "file:///clip.wav"}},
{"type": "text", "text": "Transcribe this."}
]
}]
}'

/v1/audio/speech matches OpenAI:

Terminal window
mistralrs serve -m nari-labs/Dia-1.6B

Dia understands dialogue speaker tags such as [S1] and [S2], and nonverbal parentheticals such as (laughs) or (coughs). Use them in the input string when you want dialogue or expressive speech.

Terminal window
curl http://localhost:1234/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"input": "[S1] Hello. This is a test of the text-to-speech system.",
"response_format": "wav"
}' \
--output out.wav
  • Output: raw audio bytes.
  • response_format: only wav and pcm are read; mp3/opus/aac/flac return a validation error.
  • Extra OpenAI fields such as voice, speed, and instructions are silently ignored (the request reads only model, input, and response_format).