Speech models
mistral.rs supports two speech-related model families:
- Voxtral: multimodal model accepting audio input. Used for transcription and audio understanding through
/v1/chat/completions. It uses a Whisper-style audio encoder. - Dia: dedicated text-to-speech model served via
/v1/audio/speech.
Voxtral is classified as a multimodal model (audio is one of its input modalities); Dia is classified as a dedicated speech model.
Voxtral: audio in, text out
Section titled “Voxtral: audio in, text out”mistralrs serve -m mistralai/Voxtral-Mini-3B-2507-m alone is enough: the auto-loader detects Voxtral's native Mistral layout (params.json, consolidated.safetensors, tekken.json).
Voxtral fits the multimodal chat shape: audio is an input content part, the response is text. The text prompt selects the task: transcription, summarization, speaker analysis, etc.
curl http://localhost:1234/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [{ "role": "user", "content": [ {"type": "audio_url", "audio_url": {"url": "file:///clip.wav"}}, {"type": "text", "text": "Transcribe this."} ] }] }'from mistralrs import ChatCompletionRequest, MultimodalArchitecture, Runner, Which
runner = Runner( which=Which.MultimodalPlain( model_id="mistralai/Voxtral-Mini-3B-2507", arch=MultimodalArchitecture.Voxtral, ))
response = runner.send_chat_completion_request( ChatCompletionRequest( model="default", messages=[ { "role": "user", "content": [ {"type": "audio_url", "audio_url": {"url": "file:///absolute/path/clip.wav"}}, {"type": "text", "text": "Transcribe this audio."}, ], } ], max_tokens=256, temperature=0, ))print(response.choices[0].message.content)use mistralrs::{AudioInput, MultimodalMessages, MultimodalModelBuilder, TextMessageRole};
let model = MultimodalModelBuilder::new("mistralai/Voxtral-Mini-3B-2507") .build() .await?;
let audio_bytes = std::fs::read("clip.wav")?;let audio = AudioInput::from_bytes(&audio_bytes)?;
let messages = MultimodalMessages::new().add_audio_message( TextMessageRole::User, "Transcribe this audio.", vec![audio],);
let response = model.send_chat_request(messages).await?;println!("{}", response.choices[0].message.content.as_ref().unwrap());Dia: text-to-speech
Section titled “Dia: text-to-speech”/v1/audio/speech matches OpenAI:
mistralrs serve -m nari-labs/Dia-1.6BDia understands dialogue speaker tags such as [S1] and [S2], and nonverbal parentheticals such as (laughs) or (coughs). Use them in the input string when you want dialogue or expressive speech.
curl http://localhost:1234/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "model": "default", "input": "[S1] Hello. This is a test of the text-to-speech system.", "response_format": "wav" }' \ --output out.wav- Output: raw audio bytes.
response_format: onlywavandpcmare read;mp3/opus/aac/flacreturn a validation error.- Extra OpenAI fields such as
voice,speed, andinstructionsare silently ignored (the request reads onlymodel,input, andresponse_format).
import structimport wavefrom pathlib import Path
from mistralrs import Runner, SpeechLoaderType, Which
runner = Runner( which=Which.Speech( model_id="nari-labs/Dia-1.6B", arch=SpeechLoaderType.Dia, ))
response = runner.generate_audio("[S1] mistral r s can generate speech locally.")
output_path = Path("out.wav")pcm_ints = [int(max(-32768, min(32767, sample * 32767))) for sample in response.pcm]with wave.open(output_path, "wb") as wav: wav.setnchannels(response.channels) wav.setsampwidth(2) wav.setframerate(response.rate) wav.writeframes(b"".join(struct.pack("<h", sample) for sample in pcm_ints))use mistralrs::{speech_utils, SpeechLoaderType, SpeechModelBuilder};
let model = SpeechModelBuilder::new("nari-labs/Dia-1.6B", SpeechLoaderType::Dia) .build() .await?;
let (pcm, rate, channels) = model .generate_speech("[S1] mistral r s can generate speech locally.") .await?;
let mut output = std::fs::File::create("out.wav")?;speech_utils::write_pcm_as_wav(&mut output, &pcm, rate as u32, channels as u16)?;