Call a model from Rust

The Rust SDK embeds the engine directly into a Rust program. A Rust toolchain is required; see rustup.rs.

Creating the project

cargo new --bin hello-mistralrs
cd hello-mistralrs

Add the dependencies to Cargo.toml:

[dependencies]
anyhow = "1"
mistralrs = "0.8"
tokio = { version = "1", features = ["full"] }

The default features build for CPU. For GPU acceleration, enable the matching feature:

# NVIDIA GPU (CUDA)
mistralrs = { version = "0.8", features = ["cuda", "flash-attn", "cudnn"] }

# Apple Silicon (Metal)
mistralrs = { version = "0.8", features = ["metal"] }

# Intel CPU with MKL
mistralrs = { version = "0.8", features = ["mkl"] }

Feature names match the CLI build features. The cargo features reference lists every option.

A minimal request

Replace src/main.rs:

use anyhow::Result;
use mistralrs::{IsqBits, ModelBuilder, TextMessageRole, TextMessages};

#[tokio::main]
async fn main() -> Result<()> {
    let model = ModelBuilder::new("google/gemma-4-E4B-it")
        .with_auto_isq(IsqBits::Four)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new().add_message(
        TextMessageRole::User,
        "In one sentence, what is Rust known for?",
    );

    let response = model.send_chat_request(messages).await?;
    println!("{}", response.choices[0].message.content.as_ref().unwrap());

    Ok(())
}

Run with cargo run --release.

The first run downloads Gemma 4 into the Hugging Face cache. The Gemma license must be accepted first; see Tutorial 2.

ModelBuilder is a fluent configuration object. Each method returns self. The only required input is the Hugging Face repository id passed to ModelBuilder::new. Everything else has a default.

with_auto_isq(IsqBits::Four) matches --isq 4 on the CLI. The engine selects an optimal 4-bit format per platform: AFQ4 on Metal, Q4K on CUDA or CPU. To pin a specific format, use with_isq(IsqType::Q4K), see the quantization reference.

TextMessages assembles a basic chat conversation. For per-message sampling, tool schemas, or logprobs, use RequestBuilder.

Streaming

stream_chat_request returns a futures Stream of response chunks:

use anyhow::Result;
use futures::StreamExt;
use mistralrs::{
    ChatCompletionChunkResponse, ChunkChoice, Delta, IsqBits, ModelBuilder, Response,
    TextMessageRole, TextMessages,
};
use std::io::Write;

#[tokio::main]
async fn main() -> Result<()> {
    let model = ModelBuilder::new("google/gemma-4-E4B-it")
        .with_auto_isq(IsqBits::Four)
        .build()
        .await?;

    let messages = TextMessages::new().add_message(
        TextMessageRole::User,
        "Write me a haiku about ownership.",
    );

    let mut stream = model.stream_chat_request(messages).await?;
    let stdout = std::io::stdout();
    let mut out = std::io::BufWriter::new(stdout.lock());

    while let Some(item) = stream.next().await {
        if let Response::Chunk(ChatCompletionChunkResponse { choices, .. }) = item {
            if let Some(ChunkChoice {
                delta: Delta { content: Some(text), .. },
                ..
            }) = choices.first()
            {
                out.write_all(text.as_bytes())?;
                out.flush()?;
            }
        }
    }

    Ok(())
}

The stream yields Response values. Most are Response::Chunk carrying assistant output in choices[0].delta.content. Other variants cover errors and the final completion event; production code should pattern-match exhaustively. See docs.rs/mistralrs.

Notes

ModelBuilder::build() performs all model loading and is expensive. Call it once at startup and share the resulting Model. Model is reference-counted, cheap to clone, and thread-safe.

Requests through the Rust SDK bypass the HTTP layer; there is no /v1/chat/completions endpoint and no OpenAI compatibility shim. To expose a Model over HTTP alongside direct in-process access, see the embed-in-axum guide.

Next steps

Tutorial 5: add tool calling and code execution.
Tutorial 6: choose between ISQ bit widths.
The Rust SDK guides cover async streaming, multimodal input, and embedding mistral.rs in existing web applications.