Run any model
Point mistral.rs at a model and it figures out the rest: text, multimodal, embedding, speech, and diffusion models are detected automatically from the checkpoint, so one command shape covers all of them.
mistralrs run -m Qwen/Qwen3-4Bmistralrs serve -m Qwen/Qwen3-4Brun opens an interactive chat (or one-shot with -i); serve starts the OpenAI-compatible server on port 1234.
from mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(which=Which.Plain(model_id="Qwen/Qwen3-4B"))
res = runner.send_chat_completion_request( ChatCompletionRequest( model="default", messages=[{"role": "user", "content": "Hello!"}], max_tokens=256, ))print(res.choices[0].message.content)arch is optional; it is detected from the model config. Full example.
use anyhow::Result;use mistralrs::{ModelBuilder, TextMessageRole, TextMessages};
#[tokio::main]async fn main() -> Result<()> { let model = ModelBuilder::new("Qwen/Qwen3-4B").with_logging().build().await?;
let messages = TextMessages::new().add_message(TextMessageRole::User, "Hello!"); let response = model.send_chat_request(messages).await?; println!("{}", response.choices[0].message.content.as_ref().unwrap()); Ok(())}Quantize on the way in with --quant
Section titled “Quantize on the way in with --quant”--quant <level> is the quantization front door. With a numeric level (2, 3, 4, 5, 6, 8) or an ISQ (in-situ quantization) name (q4k, afq8, …), it first looks for a prebuilt UQFF (Universal Quantized File Format) at mistralrs-community/<model-name>-UQFF and downloads the matching file; if no UQFF repo or matching shard exists (or the model is a local path), it falls back to ISQ at that level.
mistralrs run --quant 4 -m Qwen/Qwen3-4B--quant auto probes your hardware (the same analysis as mistralrs tune) and picks a level, or runs at full precision if the model fits. --quant conflicts with the explicit knobs --isq and --from-uqff; use those when you want to force ISQ or a specific UQFF file. Choosing a level and the full set of quantization options are covered in the quantization guide.
Local model directories
Section titled “Local model directories”-m accepts a local path to a directory containing the model files (safetensors plus configs, or a Mistral-native consolidated.safetensors layout):
mistralrs run -m /path/to/model-dirLocal paths are read straight from disk and never touch the network. Note that --quant skips the prebuilt-UQFF probe for local paths and goes directly to ISQ.
GGUF files
Section titled “GGUF files”GGUF is not auto-detected; select it with --format gguf and name the file with -f. The model ID can be a Hugging Face repo or a local directory containing the file:
# from the Hubmistralrs run --format gguf -m bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \ -f Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# from a local directorymistralrs run --format gguf -m /path/to/dir -f model-q4k.ggufMulti-file models pass semicolon-separated names to -f. The tokenizer and chat template are read from the GGUF metadata; pass --tok-model-id <hf-id> to source them from the original repo instead.
from mistralrs import Runner, Which
runner = Runner( which=Which.GGUF( quantized_model_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF", quantized_filename="Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf", ))use mistralrs::GgufModelBuilder;
let model = GgufModelBuilder::new( "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF", vec!["Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"],).build().await?;Forcing an architecture
Section titled “Forcing an architecture”Auto-detection covers normal checkpoints. For text models with a non-standard config, --arch (Python: arch=Architecture...) forces the loader; the accepted names are the lowercase forms in the supported models reference. Multimodal, speech, embedding, and diffusion architectures are always auto-detected on the CLI.
Chat template overrides
Section titled “Chat template overrides”Some repos ship a missing or broken chat template. Pass -c/--chat-template <file> (a .json or .jinja file) or --jinja-explicit <file> to override it; bundled fixes live in the repo’s chat_templates/ directory. See chat templates for symptoms and how to write your own.
Running offline
Section titled “Running offline”Set HF_HUB_OFFLINE=1 to guarantee no network calls are made to the Hugging Face Hub. Files and repo listings are then served from the local cache only, and missing files fail fast instead of hanging on a download.
# on a machine with network access: populate the cachemistralrs run -m Qwen/Qwen3-4B
# later, or on the air-gapped machine with the cache copied overHF_HUB_OFFLINE=1 mistralrs serve -m Qwen/Qwen3-4BPre-download with huggingface-cli download <repo> or by running mistral.rs once online; mistralrs cache list shows what is cached. Files resolve from $HF_HUB_CACHE, falling back to $HF_HOME/hub, falling back to ~/.cache/huggingface/hub. A local model path (-m /path/to/dir) always reads from disk, so it works offline without any cache lookup. Related variables (HF_HOME, HF_TOKEN, …) are in the environment variables reference.
Model-specific behavior
Section titled “Model-specific behavior”Most models need nothing beyond -m. The exceptions (thinking tags, MoE (Mixture of Experts) quantization, template fixes, MatFormer slices) are collected in model family notes.