Quickstart

mistral.rs is a single binary: mistralrs run chats with a model in your terminal, mistralrs serve exposes the same model over OpenAI-compatible and Anthropic-compatible HTTP APIs. This page covers both, plus in-process inference from Python and Rust.

macOS / Linux
Windows

# 1. Install: downloads a prebuilt binary for your platform (or builds from source if none matches)
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

# 2. Chat: downloads the weights on first run
mistralrs run --quant 4 -m Qwen/Qwen3-4B

# 3. Or serve the same model over an OpenAI-compatible API on port 1234
mistralrs serve --quant 4 -m Qwen/Qwen3-4B

# 1. Install from PowerShell (downloads a prebuilt binary, or builds from source)
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

# 2. Chat: downloads the weights on first run
mistralrs run --quant 4 -m Qwen/Qwen3-4B

# 3. Or serve the same model over an OpenAI-compatible API on port 1234
mistralrs serve --quant 4 -m Qwen/Qwen3-4B

Accelerator specifics, manual builds, and the Python/Rust SDKs are below.

Install

The install script detects your platform and downloads a matching prebuilt binary, falling back to a source build if none matches your hardware or NVIDIA driver.

Install locations, per-platform binaries, and environment variables

Prebuilt binaries do not require Rust or a CUDA toolkit for standard acceleration. Installer-managed binaries install to ~/.mistralrs, with mistralrs symlinked into ~/.local/bin on Unix and launched from %USERPROFILE%\.local\bin on Windows. If needed, the installer adds that directory to your shell environment. Optional cuTile acceleration requires NVIDIA's separately installed tileiras tool. See cuTile setup for the official package commands.

Source builds install to the same managed location.

What you get per platform:

| Platform | Binary | Acceleration | |---|---|---| | Apple Silicon (macOS) | prebuilt | Metal | | Linux x86_64 + NVIDIA GPU | prebuilt, matched to GPU and driver | CUDA, flash-attn, cuTile-capable on supported CUDA/SM pairs | | Linux aarch64 + NVIDIA GPU (Grace: GH200/GB200/GB10) | prebuilt, matched to GPU and driver | CUDA, flash-attn, cuTile-capable on supported CUDA/SM pairs | | Linux x86_64 / aarch64, no GPU | prebuilt | CPU | | Windows x86_64 | prebuilt | CPU | | anything else (Intel Mac, an unlisted GPU) | source build | detected at build time |

For NVIDIA GPUs, the installer checks nvidia-smi, then chooses the matching CUDA artifact:

| Driver reports | Artifact | |---|---| | CUDA 13.3+ on Hopper / sm90 | cuda133 | | CUDA 13.2+ on Ampere/Ada / sm80, sm86, sm89 | cuda132 | | CUDA 13.2+ on Blackwell / sm100, sm120, sm121 | cuda132 | | CUDA 13.1+ on Blackwell / sm100, sm120, sm121 | cuda131 (no cuTile) | | CUDA 13.0+ | cuda130 | | CUDA 12.9+ on GB10 / sm121 | cuda129 | | CUDA 12.8+ | cuda128 |

If no artifact matches the GPU or driver, the installer builds from source. See hardware support for the full matrix.

Install-time environment variables:

MISTRALRS_INSTALL_TAG=v0.9.0 installs a specific release instead of the latest (downloads that release's prebuilt, or builds that git tag from source).
MISTRALRS_INSTALL_FROM_SOURCE=1 skips the prebuilt download and builds the latest master (bleeding edge) from source. The prebuilt path tracks the latest stable release.
MISTRALRS_INSTALL_NO_NCCL=1 (source builds) skips the nccl feature.
MISTRALRS_INSTALL_ALLOW_CUDA_MISMATCH=1 lets a source build continue when local nvcc is newer than the CUDA version reported by the NVIDIA driver.

At runtime, HF_HOME controls where models are cached (see environment variables).

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

Need a source build instead - to track bleeding-edge master, pin a commit, or use a feature set the prebuilts omit? See Build from source.

Verify the install

mistralrs --version
mistralrs doctor

doctor reports detected hardware, compiled accelerator features, and Hugging Face cache and connectivity checks:

[INFO] CUDA: nvcc 12.4, driver 12.4
[INFO] Build features: cuda, cudnn, flash-attn

If your accelerator is missing from build features, rebuild with the right --features. For other failures, see Troubleshooting.

Run your first model

mistralrs run --quant 4 -m Qwen/Qwen3-4B

The first run downloads the weights from Hugging Face into the local cache, then loads the model onto the detected accelerator. mistralrs infers the architecture, chat template, and target device from the repository; every inferred choice can be overridden with a flag.

--quant 4 prefers a prebuilt UQFF (Universal Quantized File Format) quantization from mistralrs-community when one is published, otherwise quantizes the weights as they load (in-situ). Omit it to load native precision (BF16 for Qwen3-4B, about 8 GB). See Quantization for the decision guide.

The model is ready when an empty prompt appears. Type a message and press Enter; the response streams a token at a time.

> What does Rust's ownership system actually buy you?
Rust's ownership model gives you memory safety without a garbage collector...

Commands available at the prompt:

/help: list commands
/exit: quit (Ctrl+D also works)
/system <message>: add a system message without running the model
/clear: clear the chat history
/temperature <float>, /topk <int>, /topp <float>: adjust sampling

With multimodal models, include image, audio, or video paths or URLs directly in the prompt. For one-shot use, pass -i "your prompt" (optionally with --image, --video, or --audio) to send a single request and exit. The full flag set is in the CLI reference.

Serve an API

mistralrs serve --quant 4 -m Qwen/Qwen3-4B

When loading completes:

2026-06-13T12:00:00.000000Z  INFO mistralrs_cli::commands::serve: Server listening on http://0.0.0.0:1234

The server binds 0.0.0.0 by default, reachable from any host on the network; pass --host 127.0.0.1 to restrict it and --port to change the port.

Leave it running and send a Chat Completions request from a second terminal. The same request, on three surfaces:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "In one sentence, what is mistral.rs?"}
    ]
  }'

Any OpenAI client works against http://localhost:1234/v1:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Write me a haiku about Rust."}],
)
print(response.choices[0].message.content)

The api_key field is required by the client but not validated by the server.

Send to the running server with any HTTP client (here reqwest):

use serde_json::json;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let body = json!({
        "model": "default",
        "messages": [
            {"role": "user", "content": "Write me a haiku about Rust."}
        ]
    });
    let resp: serde_json::Value = reqwest::Client::new()
        .post("http://localhost:1234/v1/chat/completions")
        .json(&body)
        .send()
        .await?
        .json()
        .await?;
    println!("{}", resp["choices"][0]["message"]["content"]);
    Ok(())
}

To run inference in-process without a server, use the mistralrs crate directly (see Embed in Rust below).

The model name default is special-cased server-side: it always routes to the server's default model, so clients work without knowing the real model id. GET /v1/models lists the real id. Full example: server chat.

Three more things the same process gives you:

A web UI at http://localhost:1234/ui (pass --no-ui to disable). See Web UI.
Anthropic-compatible Messages endpoints at base http://localhost:1234. See Anthropic Messages API.
Interactive API docs at http://localhost:1234/docs.

The OpenAI-compatible API guide covers streaming, multiple models, and configuration.

Gated models: log in to Hugging Face

Some models (Gemma, Llama, and others) require accepting a license before download. One-time setup per account, using google/gemma-4-E4B-it as the example:

Open the model page on Hugging Face, sign in, and accept the license at the top.
Create a read-only access token at huggingface.co/settings/tokens.
Pass the token to mistral.rs:

mistralrs login

login prompts for the token (or accepts --token hf_... non-interactively) and saves it to the Hugging Face token file ($HF_HOME/token, default ~/.cache/huggingface/token). If you have already logged in via huggingface-cli, skip this step: both tools read the same file. Then run or serve the gated model as usual:

mistralrs serve --quant 4 -m google/gemma-4-E4B-it

Embed in Rust

Run inference in-process, without a server, using the mistralrs crate. Add mistralrs and tokio to Cargo.toml (build with the same accelerator features as the CLI, e.g. --features metal or --features cuda):

use mistralrs::{IsqBits, ModelBuilder, TextMessageRole, TextMessages};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let model = ModelBuilder::new("Qwen/Qwen3-4B")
        .with_auto_isq(IsqBits::Four)
        .build()
        .await?;

    let messages =
        TextMessages::new().add_message(TextMessageRole::User, "Write me a haiku about Rust.");

    let response = model.send_chat_request(messages).await?;
    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    Ok(())
}

with_auto_isq(IsqBits::Four) is the in-process equivalent of --quant 4. See the Rust SDK guide for streaming, multimodal input, and sampling options.

Next steps

Run any model: Hugging Face ids, local files, GGUF, and quantization flags.
OpenAI-compatible API: the canonical serving guide.
Quantization: fit larger models on the same hardware.
Agents & tools: tool calling, web search, code execution, shell, OpenAI-compatible Skills, and MCP (Model Context Protocol).
Python SDK and Rust SDK: in-process inference without a server.