Your first model
Installing
Section titled “Installing”The install script detects your accelerator (NVIDIA GPU, Apple Silicon, Intel CPU with MKL, or none) and builds the binary with the matching feature flags.
Linux or macOS:
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | shWindows (PowerShell):
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iexThe binary is installed to ~/.cargo/bin/mistralrs. The installer adds that directory to your PATH, but the change does not apply to the current shell. Start a new shell, or run source "$HOME/.cargo/env". Then verify:
mistralrs --versionIf the command prints a version, installation succeeded. To check detected hardware, compiled accelerator features, and Hugging Face connectivity, run:
mistralrs doctorFor “command not found” or missing-toolkit errors, see the per-platform installation guides.
Running a model
Section titled “Running a model”Qwen3-4B is used here. The native BF16 weights are about 8 GB and fit on a 12 GB GPU. The license does not require acceptance on Hugging Face. On smaller GPUs the download succeeds but the model will not fit at native precision; see Tutorial 6 for quantization.
mistralrs run -m Qwen/Qwen3-4BOn first run the command:
- Downloads the weights and configuration from Hugging Face into the local cache
- Loads the model onto the detected accelerator
The model is ready when an empty prompt appears:
>Type a message and press Enter. The response streams a token at a time:
> What does Rust's ownership system actually buy you?Rust's ownership model gives you memory safety without a garbage collector...A few commands are available at the prompt: /clear resets the conversation, /exit (or Ctrl+D) quits, and /help lists the rest.
The model loads at native precision (BF16 for Qwen3-4B), so the full weights must fit in GPU memory. For larger models that do not fit, use --quant 4: it prefers a prebuilt UQFF from mistralrs-community if one exists, otherwise applies ISQ at 4 bits. --quant auto benchmarks your hardware and picks. See Tutorial 6 for the details.
mistralrs infers the model architecture, chat template, and target device from the Hugging Face repository. Every inferred choice can be overridden with a flag.
Next steps
Section titled “Next steps”- Serving a model as an API: put the same model behind an OpenAI-compatible HTTP endpoint.
- Using the Python SDK: embed a model in a Python program.
- Quantizing a model: run larger models on the same hardware.