Introduction

Quick Links
| I want to… | Go to… |
|---|---|
| Install mistral.rs | Installation Guide |
| Understand cargo features | Cargo Features |
| Run a model | CLI Reference |
| Use the HTTP API | HTTP Server |
| Fix an error | Troubleshooting |
| Configure environment | Configuration |
| Check model support | Supported Models |
Getting Started
- Installation Guide - Install mistral.rs on your system
- Cargo Features - Complete cargo features reference
- CLI Reference - Complete CLI command reference
- CLI TOML Configuration - Configure via TOML files
- Troubleshooting - Common issues and solutions
SDKs & APIs
- Python SDK - Python package documentation
- Python Installation - Python SDK installation guide
- Rust SDK - Rust crate documentation
- HTTP Server - OpenAI-compatible HTTP API
- OpenResponses API - Stateful conversation API
Models
By Category
- Supported Models - Complete model list and compatibility
- Vision Models - Vision model overview
- Image Generation - Diffusion models
- Embeddings - Embedding model overview
Model-Specific Guides
Click to expand model guides
Text Models:
- DeepSeek V2 | DeepSeek V3
- Gemma 2 | Gemma 3 | Gemma 3n
- GLM4 | GLM-4.7-Flash | GLM-4.7
- Qwen 3 | SmolLM3 | GPT-OSS
Vision Models:
- Idefics 2 | Idefics 3
- LLaVA | Llama 3.2 Vision | Llama 4
- MiniCPM-O 2.6 | Mistral 3
- Phi 3.5 MoE | Phi 3.5 Vision | Phi 4 Multimodal
- Qwen 2-VL | Qwen 3 VL
Other Models:
Quantization & Optimization
- Quantization Overview - All supported quantization methods
- ISQ (In-Situ Quantization) - Quantize models at load time
- UQFF Format - Pre-quantized model format | Layout
- Topology - Per-layer quantization and device mapping
- Importance Matrix - Improve ISQ accuracy
Adapters & Model Customization
- Adapter Models - LoRA and X-LoRA support
- LoRA/X-LoRA Examples
- Non-Granular Scalings - X-LoRA optimization
- AnyMoE - Create MoE models from dense models
- MatFormer - Dynamic model sizing
Performance & Hardware
- Device Mapping - Multi-GPU and CPU offloading
- PagedAttention - Efficient KV cache management
- Speculative Decoding - Accelerate generation with draft models
- Flash Attention - Accelerated attention
- MLA - Multi-head Latent Attention
- Distributed Inference
Features
- Tool Calling - Function calling support
- Web Search - Integrated web search
- Chat Templates - Template customization
- Sampling Options - Generation parameters
- TOML Selector - Model selection syntax
- Multi-Model Support - Load multiple models
MCP (Model Context Protocol)
- MCP Client - Connect to external tools
- MCP Server - Serve models over MCP
- MCP Configuration
- MCP Transports
- MCP Advanced Usage
Reference
- Configuration - Environment variables and server defaults
- Engine Internals - Engine behaviors and recovery
- Supported Models - Complete compatibility tables
Contributing
See the main README for contribution guidelines.
Installation Guide
Quick Install (Recommended)
The install script automatically detects your hardware (CUDA, Metal, MKL) and builds with optimal features.
Linux/macOS:
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh
Windows (PowerShell):
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex
Prerequisites
-
Install required packages:
- OpenSSL:
sudo apt install libssl-dev(Ubuntu) - pkg-config (Linux only):
sudo apt install pkg-config
- OpenSSL:
-
Install Rust from https://rustup.rs/
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env -
(Optional) Set up HuggingFace authentication:
mistralrs loginOr use
huggingface-cli loginas documented here.
Supported Accelerators
| Accelerator | Feature Flag | Additional Flags |
|---|---|---|
| NVIDIA GPUs (CUDA) | cuda | flash-attn, flash-attn-v3, cudnn |
| Apple Silicon GPU (Metal) | metal | |
| CPU (Intel) | mkl | |
| CPU (Apple Accelerate) | accelerate | |
| Generic CPU (ARM/AVX) | none | ARM NEON / AVX enabled by default |
Note for Linux users: The
metalfeature is macOS-only. Use--features "cuda flash-attn cudnn"for NVIDIA GPUs or--features mklfor Intel CPUs instead of--all-features.
Feature Detection
Determine which features to enable based on your hardware:
| Hardware | Features |
|---|---|
| NVIDIA GPU (Ampere+, compute >=80) | cuda cudnn flash-attn |
| NVIDIA GPU (Hopper, compute 90) | cuda cudnn flash-attn flash-attn-v3 |
| NVIDIA GPU (older) | cuda cudnn |
| Apple Silicon (macOS) | metal accelerate |
| Intel CPU with MKL | mkl |
| CPU only | (no features needed) |
Install from crates.io
cargo install mistralrs-cli --features "<your-features>"
Example:
cargo install mistralrs-cli --features "cuda flash-attn cudnn"
Build from Source
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo install --path mistralrs-cli --features "<your-features>"
Example:
cargo build --release --features "cuda flash-attn cudnn"
Docker
Docker images are available for quick deployment:
docker pull ghcr.io/ericlbuehler/mistral.rs:latest
docker run --gpus all -p 1234:1234 ghcr.io/ericlbuehler/mistral.rs:latest \
serve -m Qwen/Qwen3-4B
Docker images on GitHub Container Registry
Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/
Python SDK
Install the Python package:
pip install mistralrs-cuda # For NVIDIA GPUs
pip install mistralrs-metal # For Apple Silicon
pip install mistralrs-mkl # For Intel CPUs
pip install mistralrs # CPU-only
Verify Installation
After installation, verify everything works:
# Check CLI is installed
mistralrs --help
# Run system diagnostics
mistralrs doctor
# Test with a small model
mistralrs run -m Qwen/Qwen3-0.6B
Getting Models
From Hugging Face Hub (Default)
Models download automatically from Hugging Face Hub:
mistralrs run -m meta-llama/Llama-3.2-3B-Instruct
For gated models, authenticate first:
mistralrs login
# Or: mistralrs run --token-source env:HF_TOKEN -m <model>
From Local Files
Pass a path to a downloaded model:
mistralrs run -m /path/to/model
Running GGUF Models
mistralrs run --format gguf -m author/model-repo -f model-quant.gguf
Specify tokenizer if needed:
mistralrs run --format gguf -m author/model-repo -f file.gguf -t author/official-tokenizer
Next Steps
- CLI Reference - All commands and options
- HTTP API - Run as an OpenAI-compatible server
- Python SDK - Python package documentation
- Troubleshooting - Common issues and solutions
Cargo Features Reference
This document provides a complete reference for all cargo features available in mistral.rs.
Quick Reference
| Feature | Description | Platform | Requires |
|---|---|---|---|
cuda | NVIDIA GPU acceleration | Linux, Windows | CUDA toolkit |
cudnn | NVIDIA cuDNN backend | Linux, Windows | cuda, cuDNN |
flash-attn | FlashAttention V2 | Linux, Windows | cuda, CC >= 8.0 |
flash-attn-v3 | FlashAttention V3 | Linux, Windows | cuda, CC >= 9.0 |
metal | Apple GPU acceleration | macOS | - |
accelerate | Apple CPU acceleration | macOS | - |
mkl | Intel MKL acceleration | Linux, Windows | Intel MKL |
nccl | Multi-GPU (NVIDIA NCCL) | Linux | cuda, NCCL |
ring | Multi-GPU/node (TCP ring) | All | - |
GPU Acceleration Features
cuda
Enables NVIDIA GPU acceleration via CUDA. This is the primary feature for running on NVIDIA GPUs.
Requirements:
- NVIDIA GPU
- CUDA toolkit installed
- Linux or Windows (WSL supported)
Usage:
cargo build --release --features cuda
cargo install mistralrs-cli --features cuda
What it enables:
- GPU tensor operations via CUDA
- PagedAttention on CUDA devices
- Quantized inference on GPU
cudnn
Enables NVIDIA cuDNN for optimized neural network primitives. Provides faster convolutions and other operations.
Requirements:
cudafeature- cuDNN library installed
Usage:
cargo build --release --features "cuda cudnn"
flash-attn
Enables FlashAttention V2 for faster attention computation. Significantly reduces memory usage and improves throughput.
Requirements:
cudafeature (automatically enabled)- GPU with compute capability >= 8.0 (Ampere or newer)
Compatible GPUs:
| Architecture | Compute Capability | Example GPUs |
|---|---|---|
| Ampere | 8.0, 8.6 | RTX 30 series, A100, A40 |
| Ada Lovelace | 8.9 | RTX 40 series, L40S |
| Blackwell | 10.0, 12.0 | RTX 50 series |
Usage:
cargo build --release --features "cuda flash-attn cudnn"
Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.
flash-attn-v3
Enables FlashAttention V3 for Hopper architecture GPUs. Provides additional performance improvements over V2 on supported hardware.
Requirements:
cudafeature (automatically enabled)- GPU with compute capability >= 9.0 (Hopper)
Compatible GPUs:
| Architecture | Compute Capability | Example GPUs |
|---|---|---|
| Hopper | 9.0 | H100, H800 |
Usage:
cargo build --release --features "cuda flash-attn-v3 cudnn"
Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.
metal
Enables Apple Metal GPU acceleration for macOS devices.
Requirements:
- macOS with Apple Silicon or AMD GPU
- macOS only (not available on Linux)
Usage:
cargo build --release --features metal
What it enables:
- GPU tensor operations via Metal
- PagedAttention on Metal devices (opt-in via
--paged-attn) - Quantized inference on Apple GPUs
Note: PagedAttention is disabled by default on Metal. Enable with
--paged-attnflag.
CPU Acceleration Features
accelerate
Enables Apple’s Accelerate framework for optimized CPU operations on macOS.
Requirements:
- macOS
Usage:
cargo build --release --features accelerate
# Or combined with Metal:
cargo build --release --features "metal accelerate"
mkl
Enables Intel Math Kernel Library (MKL) for optimized CPU operations.
Requirements:
- Intel MKL installed
- Intel CPU recommended (works on AMD but Intel-optimized)
Usage:
cargo build --release --features mkl
Distributed Inference Features
nccl
Enables multi-GPU distributed inference using NVIDIA NCCL (NVIDIA Collective Communications Library). Implements tensor parallelism for splitting large models across multiple GPUs.
Requirements:
cudafeature (automatically enabled)- Multiple NVIDIA GPUs
- NCCL library
- World size must be a power of 2 (1, 2, 4, 8, etc.)
Usage:
cargo build --release --features "cuda nccl"
# Run with specific GPU count
MISTRALRS_MN_LOCAL_WORLD_SIZE=2 mistralrs serve -m Qwen/Qwen3-30B-A3B-Instruct
Environment Variables:
| Variable | Description |
|---|---|
MISTRALRS_MN_LOCAL_WORLD_SIZE | Number of GPUs to use (defaults to all) |
MISTRALRS_NO_NCCL=1 | Disable NCCL and use device mapping instead |
Multi-node setup requires additional environment variables. See NCCL documentation for details.
Note: When NCCL is enabled, automatic device mapping is disabled.
ring
Enables distributed tensor-parallel inference using a TCP-based ring topology. Works across multiple machines without requiring NCCL.
Requirements:
- World size must be a power of 2 (2, 4, 8, etc.)
- TCP ports must be open between nodes
Usage:
cargo build --release --features ring
# Configure via JSON file
export RING_CONFIG=path/to/ring_config.json
mistralrs serve -m model-id
Configuration:
Create a JSON configuration file for each process:
{
"master_ip": "0.0.0.0",
"master_port": 1234,
"port": 12345,
"right_port": 12346,
"rank": 0,
"world_size": 2
}
| Field | Description |
|---|---|
master_ip | IP address for master node |
master_port | Port for master node |
port | Local port for incoming connections |
right_port | Port of right neighbor in ring |
right_ip | IP of right neighbor (optional, defaults to localhost) |
rank | Process rank (0 to world_size-1) |
world_size | Total number of processes (must be power of 2) |
See Ring documentation for detailed setup instructions.
Feature Combinations
Recommended Combinations by Hardware
| Hardware | Recommended Features |
|---|---|
| NVIDIA Ampere+ (RTX 30/40, A100) | cuda cudnn flash-attn |
| NVIDIA Hopper (H100) | cuda cudnn flash-attn-v3 |
| NVIDIA older GPUs | cuda cudnn |
| Apple Silicon | metal accelerate |
| Intel CPU | mkl |
| Generic CPU | (no features needed) |
| Multi-GPU NVIDIA | cuda cudnn flash-attn nccl |
| Multi-node/cross-platform | ring (plus GPU features) |
Installation Examples
# NVIDIA GPU with all optimizations
cargo install mistralrs-cli --features "cuda cudnn flash-attn"
# Apple Silicon
cargo install mistralrs-cli --features "metal accelerate"
# Intel CPU
cargo install mistralrs-cli --features "mkl"
# Multi-GPU NVIDIA setup
cargo install mistralrs-cli --features "cuda cudnn flash-attn nccl"
# Build from source with CUDA
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo build --release --features "cuda cudnn flash-attn"
Internal Features
These features are primarily for library development and are not typically used directly:
| Feature | Description |
|---|---|
pyo3_macros | Python bindings support (used by mistralrs-pyo3) |
utoipa | OpenAPI documentation generation |
Python Package Features
The Python SDK is distributed as separate packages with features pre-configured:
| Package | Equivalent Features |
|---|---|
mistralrs-cuda | cuda cudnn flash-attn |
mistralrs-metal | metal accelerate |
mistralrs-mkl | mkl |
mistralrs | CPU only |
pip install mistralrs-cuda # NVIDIA GPUs
pip install mistralrs-metal # Apple Silicon
pip install mistralrs-mkl # Intel CPUs
pip install mistralrs # Generic CPU
Troubleshooting
Diagnosing Issues
Use mistralrs doctor to diagnose your system configuration and verify features are working correctly:
mistralrs doctor
This command checks:
- Detected hardware (GPUs, CPU features)
- Installed libraries (CUDA, cuDNN, etc.)
- Feature compatibility
- Common configuration issues
Feature not working
-
Run
mistralrs doctorto check system configuration -
Verify the feature is enabled in your build:
cargo build --release --features "your-features" -v -
Check hardware compatibility (especially for flash-attn)
-
Ensure required libraries are installed (CUDA, cuDNN, MKL, etc.)
Conflicting features
flash-attnandflash-attn-v3are mutually exclusivemetalis macOS-only; don’t use withcudancclrequirescuda
Build errors
- CUDA not found: Ensure CUDA toolkit is installed and
nvccis in PATH - MKL not found: Install Intel oneAPI or standalone MKL
- Metal errors on Linux: Remove
metalfeature (macOS only)
See Troubleshooting for more solutions.
mistralrs CLI Reference
This is the comprehensive CLI reference for mistralrs. The CLI provides commands for interactive mode, HTTP server, builtin UI, quantization, and system diagnostics.
Table of Contents
- Commands
- run: run model in interactive mode
- serve: start HTTP/MCP server and (optionally) the UI
- from-config: run from a TOML configuration file
- quantize: generate UQFF quantized model file
- tune: recommend quantization + device mapping for a model
- doctor: run system diagnostics and environment checks
- login: authenticate with HuggingFace Hub
- cache: manage the HuggingFace model cache
- bench: run performance benchmarks
- completions: generate shell completions
- Model Types
- Features
- Global Options
- Interactive Commands
Commands
run - Interactive Mode
Start a model in interactive mode for conversational use.
mistralrs run [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]
Note: MODEL_TYPE is optional and defaults to auto if not specified. This allows a shorter syntax.
Examples:
# Run a text model interactively (shorthand - auto type is implied)
mistralrs run -m Qwen/Qwen3-4B
# Explicit auto type (equivalent to above)
mistralrs run -m Qwen/Qwen3-4B
# Run with thinking mode enabled
mistralrs run -m Qwen/Qwen3-4B --enable-thinking
# Run a vision model
mistralrs run -m google/gemma-3-4b-it
Options:
| Option | Description |
|---|---|
--enable-thinking | Enable thinking mode for models that support it |
The run command also accepts all runtime options.
serve - HTTP Server
Start an HTTP server with OpenAI-compatible API endpoints.
mistralrs serve [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]
Note: MODEL_TYPE is optional and defaults to auto if not specified.
Examples:
# Start server on default port 1234 (shorthand)
mistralrs serve -m Qwen/Qwen3-4B
# Explicit auto type (equivalent to above)
mistralrs serve -m Qwen/Qwen3-4B
# Start server with web UI
mistralrs serve -m Qwen/Qwen3-4B --ui
# Start server on custom port
mistralrs serve -m Qwen/Qwen3-4B -p 3000
# Start server with MCP support
mistralrs serve -m Qwen/Qwen3-4B --mcp-port 8081
Server Options:
| Option | Default | Description |
|---|---|---|
-p, --port <PORT> | 1234 | HTTP server port |
--host <HOST> | 0.0.0.0 | Bind address |
--ui | disabled | Serve built-in web UI at /ui |
--mcp-port <PORT> | none | MCP protocol server port |
--mcp-config <PATH> | none | MCP client configuration file |
The serve command also accepts all runtime options.
quantize - UQFF Generation
Generate a UQFF (Unified Quantized File Format) file from a model.
mistralrs quantize <MODEL_TYPE> -m <MODEL_ID> --isq <LEVEL> -o <OUTPUT>
Examples:
# Quantize a text model to 4-bit
mistralrs quantize -m Qwen/Qwen3-4B --isq 4 -o qwen3-4b-q4.uqff
# Quantize with Q4_K format
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-q4k.uqff
# Quantize a vision model
mistralrs quantize -m google/gemma-3-4b-it --isq 4 -o gemma3-4b-q4.uqff
# Quantize with imatrix for better quality
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k --imatrix imatrix.dat -o qwen3-4b-q4k.uqff
Quantize Options:
| Option | Required | Description |
|---|---|---|
-m, --model-id <ID> | Yes | Model ID or local path |
--isq <LEVEL> | Yes | Quantization level (see ISQ Quantization) |
-o, --output <PATH> | Yes | Output UQFF file path |
--isq-organization <TYPE> | No | ISQ organization strategy: default or moqe |
--imatrix <PATH> | No | imatrix file for enhanced quantization |
--calibration-file <PATH> | No | Calibration file for imatrix generation |
tune - Recommendations
Get quantization and device mapping recommendations for a model. The tune command analyzes your hardware and shows all quantization options with their estimated memory usage, context room, and quality trade-offs.
mistralrs tune [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]
Note: MODEL_TYPE is optional and defaults to auto if not specified, which supports all model types. See details.
Examples:
# Get balanced recommendations (shorthand)
mistralrs tune -m Qwen/Qwen3-4B
# Get quality-focused recommendations
mistralrs tune -m Qwen/Qwen3-4B --profile quality
# Get fast inference recommendations
mistralrs tune -m Qwen/Qwen3-4B --profile fast
# Output as JSON
mistralrs tune -m Qwen/Qwen3-4B --json
# Generate a TOML config file with recommendations
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml
Example Output (CUDA):
Tuning Analysis
===============
Model: Qwen/Qwen3-4B
Profile: Balanced
Backend: cuda
Total VRAM: 24.0 GB
Quantization Options
--------------------
┌─────────────┬───────────┬────────┬──────────────┬───────────────┬──────────────────┐
│ Quant │ Est. Size │ VRAM % │ Context Room │ Quality │ Status │
├─────────────┼───────────┼────────┼──────────────┼───────────────┼──────────────────┤
│ None (FP16) │ 8.50 GB │ 35% │ 48k │ Baseline │ ✅ Fits │
│ Q8_0 │ 4.50 GB │ 19% │ 96k │ Near-lossless │ 🚀 Recommended │
│ Q6K │ 3.70 GB │ 15% │ 128k (max) │ Good │ ✅ Fits │
│ Q5K │ 3.20 GB │ 13% │ 128k (max) │ Good │ ✅ Fits │
│ Q4K │ 2.60 GB │ 11% │ 128k (max) │ Acceptable │ ✅ Fits │
│ Q3K │ 2.00 GB │ 8% │ 128k (max) │ Degraded │ ✅ Fits │
│ Q2K │ 1.50 GB │ 6% │ 128k (max) │ Degraded │ ✅ Fits │
└─────────────┴───────────┴────────┴──────────────┴───────────────┴──────────────────┘
Recommended Command
-------------------
mistralrs serve -m Qwen/Qwen3-4B --isq q8_0
[INFO] PagedAttention is available (mode: auto)
Example Output (Metal):
On macOS with Metal, the command recommends Apple Format Quantization (AFQ) types:
Quantization Options
--------------------
┌─────────────┬───────────┬────────┬──────────────┬───────────────┬──────────────────┐
│ Quant │ Est. Size │ VRAM % │ Context Room │ Quality │ Status │
├─────────────┼───────────┼────────┼──────────────┼───────────────┼──────────────────┤
│ None (FP16) │ 8.50 GB │ 53% │ 24k │ Baseline │ ✅ Fits │
│ AFQ8 │ 4.50 GB │ 28% │ 56k │ Near-lossless │ 🚀 Recommended │
│ AFQ6 │ 3.70 GB │ 23% │ 64k │ Good │ ✅ Fits │
│ AFQ4 │ 2.60 GB │ 16% │ 128k (max) │ Acceptable │ ✅ Fits │
│ AFQ3 │ 2.00 GB │ 13% │ 128k (max) │ Degraded │ ✅ Fits │
│ AFQ2 │ 1.50 GB │ 9% │ 128k (max) │ Degraded │ ✅ Fits │
└─────────────┴───────────┴────────┴──────────────┴───────────────┴──────────────────┘
Status Legend:
- 🚀 Recommended: Best option for your profile and hardware
- ✅ Fits: Model fits entirely in GPU memory
- ⚠️ Hybrid: Model requires CPU offloading (slower due to PCIe bottleneck)
- ❌ Too Large: Model doesn’t fit even with CPU offload
Tune Options:
| Option | Default | Description |
|---|---|---|
--profile <PROFILE> | balanced | Tuning profile: quality, balanced, or fast |
--json | disabled | Output JSON instead of human-readable text |
--emit-config <PATH> | none | Emit a TOML config file with recommended settings |
doctor - System Diagnostics
Run comprehensive system diagnostics and environment checks. The doctor command helps identify configuration issues and validates your system is ready for inference.
mistralrs doctor [OPTIONS]
Examples:
# Run diagnostics
mistralrs doctor
# Output as JSON
mistralrs doctor --json
Checks Performed:
- CPU Extensions: AVX, AVX2, AVX-512, FMA support (x86 only; ARM shows NEON)
- Binary/Hardware Match: Validates CUDA/Metal features match detected hardware
- GPU Compute Capability: Reports compute version and Flash Attention v2/v3 compatibility
- Flash Attention Features: Warns if hardware supports FA but binary doesn’t have it enabled
- Hugging Face Connectivity: Tests connection and token validity using a gated model
- HF Cache: Verifies cache directory is writable
- Disk Space: Checks available storage
Options:
| Option | Description |
|---|---|
--json | Output JSON instead of human-readable text |
login - HuggingFace Authentication
Authenticate with HuggingFace Hub by saving your token to the local cache.
mistralrs login [OPTIONS]
Examples:
# Interactive login (prompts for token)
mistralrs login
# Provide token directly
mistralrs login --token hf_xxxxxxxxxxxxx
The token is saved to the standard HuggingFace cache location:
- Linux/macOS:
~/.cache/huggingface/token - Windows:
C:\Users\<user>\.cache\huggingface\token
If the HF_HOME environment variable is set, the token is saved to $HF_HOME/token.
Options:
| Option | Description |
|---|---|
--token <TOKEN> | Provide token directly (non-interactive) |
cache - Model Management
Manage the HuggingFace model cache. List cached models or delete specific models.
mistralrs cache <SUBCOMMAND>
Subcommands:
cache list
List all cached models with their sizes and last used times.
mistralrs cache list
Example output:
HuggingFace Model Cache
-----------------------
┌──────────────────────────┬──────────┬─────────────┐
│ Model │ Size │ Last Used │
├──────────────────────────┼──────────┼─────────────┤
│ Qwen/Qwen3-4B │ 8.5 GB │ today │
│ google/gemma-3-4b-it │ 6.2 GB │ 2 days ago │
│ meta-llama/Llama-3.2-3B │ 5.8 GB │ 1 week ago │
└──────────────────────────┴──────────┴─────────────┘
Total: 3 models, 20.5 GB
Cache directory: /home/user/.cache/huggingface/hub
cache delete
Delete a specific model from the cache.
mistralrs cache delete -m <MODEL_ID>
Examples:
# Delete a specific model
mistralrs cache delete -m Qwen/Qwen3-4B
# Delete a model with organization
mistralrs cache delete -m meta-llama/Llama-3.2-3B
bench - Performance Benchmarking
Run performance benchmarks to measure prefill and decode speeds.
mistralrs bench [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]
Note: MODEL_TYPE is optional and defaults to auto if not specified.
Examples:
# Run default benchmark (512 prompt tokens, 128 generated tokens, 3 iterations)
mistralrs bench -m Qwen/Qwen3-4B
# Custom prompt and generation lengths
mistralrs bench -m Qwen/Qwen3-4B --prompt-len 1024 --gen-len 256
# More iterations for better statistics
mistralrs bench -m Qwen/Qwen3-4B --iterations 10
# With ISQ quantization
mistralrs bench -m Qwen/Qwen3-4B --isq q4k
Example output:
Benchmark Results
=================
Model: Qwen/Qwen3-4B
Iterations: 3
┌────────────────────────┬─────────────────┬─────────────────┐
│ Test │ T/s │ Latency │
├────────────────────────┼─────────────────┼─────────────────┤
│ Prefill (512 tokens) │ 2847.3 ± 45.2 │ 179.82 ms (TTFT)│
│ Decode (128 tokens) │ 87.4 ± 2.1 │ 11.44 ms/T │
└────────────────────────┴─────────────────┴─────────────────┘
- T/s: Tokens per second (throughput)
- Latency: For prefill, shows TTFT (Time To First Token) in milliseconds. For decode, shows ms per token.
Options:
| Option | Default | Description |
|---|---|---|
--prompt-len <N> | 512 | Number of tokens in prompt (prefill test) |
--gen-len <N> | 128 | Number of tokens to generate (decode test) |
--iterations <N> | 3 | Number of benchmark iterations |
--warmup <N> | 1 | Number of warmup runs (discarded) |
The bench command also accepts all model loading options (ISQ, device mapping, etc.).
from-config - TOML Configuration
Run the CLI from a TOML configuration file. This is the recommended way to run multiple models simultaneously, including models of different types (e.g., text + vision + embedding).
See CLI_CONFIG.md for full TOML configuration format details.
mistralrs from-config --file <PATH>
Example:
mistralrs from-config --file config.toml
Multi-model example (config.toml):
command = "serve"
[server]
port = 1234
ui = true
[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"
[[models]]
kind = "vision"
model_id = "google/gemma-3-4b-it"
[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"
completions - Shell Completions
Generate shell completions for your shell.
mistralrs completions <SHELL>
Examples:
# Generate bash completions
mistralrs completions bash > ~/.local/share/bash-completion/completions/mistralrs
# Generate zsh completions
mistralrs completions zsh > ~/.zfunc/_mistralrs
# Generate fish completions
mistralrs completions fish > ~/.config/fish/completions/mistralrs.fish
Supported Shells: bash, zsh, fish, elvish, powershell
Model Types
auto
Auto-detect model type. This is the recommended option for most models and is on by default simply by leaving out the explicit model type.
mistralrs run -m Qwen/Qwen3-4B
mistralrs serve -m Qwen/Qwen3-4B
The auto type supports text, vision, and other model types through automatic detection.
text
Explicit text generation model configuration.
mistralrs run text -m Qwen/Qwen3-4B
mistralrs serve text -m Qwen/Qwen3-4B
vision
Vision-language models that can process images and text.
mistralrs run vision -m google/gemma-3-4b-it
mistralrs serve vision -m google/gemma-3-4b-it
Vision Options:
| Option | Description |
|---|---|
--max-edge <SIZE> | Maximum edge length for image resizing (aspect ratio preserved) |
--max-num-images <N> | Maximum number of images per request |
--max-image-length <SIZE> | Maximum image dimension for device mapping |
diffusion
Image generation models using diffusion.
mistralrs run diffusion -m black-forest-labs/FLUX.1-schnell
mistralrs serve diffusion -m black-forest-labs/FLUX.1-schnell
speech
Speech synthesis models.
mistralrs run speech -m nari-labs/Dia-1.6B
mistralrs serve speech -m nari-labs/Dia-1.6B
embedding
Text embedding models. These do not support interactive mode but can be used with the HTTP server.
mistralrs serve embedding -m google/embeddinggemma-300m
Features
ISQ Quantization
In-situ quantization (ISQ) reduces model memory usage by quantizing weights at load time. See details about ISQ here.
Usage:
# Simple bit-width quantization
mistralrs run -m Qwen/Qwen3-4B --isq 4
mistralrs run -m Qwen/Qwen3-4B --isq 8
# GGML-style quantization
mistralrs run -m Qwen/Qwen3-4B --isq q4_0
mistralrs run -m Qwen/Qwen3-4B --isq q4_1
mistralrs run -m Qwen/Qwen3-4B --isq q4k
mistralrs run -m Qwen/Qwen3-4B --isq q5k
mistralrs run -m Qwen/Qwen3-4B --isq q6k
ISQ Organization:
# Use MOQE organization for potentially better quality
mistralrs run -m Qwen/Qwen3-4B --isq q4k --isq-organization moqe
UQFF Files
UQFF (Unified Quantized File Format) provides pre-quantized model files for faster loading.
Generate a UQFF file:
mistralrs quantize auto -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-q4k.uqff
Load from UQFF:
mistralrs run -m Qwen/Qwen3-4B --from-uqff qwen3-4b-q4k.uqff
Multiple UQFF files (semicolon-separated):
mistralrs run -m Qwen/Qwen3-4B --from-uqff "part1.uqff;part2.uqff"
PagedAttention
PagedAttention enables efficient memory management for the KV cache. It is automatically enabled on CUDA and disabled on Metal/CPU by default.
Control PagedAttention:
# Auto mode (default): enabled on CUDA, disabled on Metal/CPU
mistralrs serve -m Qwen/Qwen3-4B --paged-attn auto
# Force enable
mistralrs serve -m Qwen/Qwen3-4B --paged-attn on
# Force disable
mistralrs serve -m Qwen/Qwen3-4B --paged-attn off
Memory allocation options (mutually exclusive):
# Allocate for specific context length (recommended)
mistralrs serve -m Qwen/Qwen3-4B --pa-context-len 8192
# Allocate specific GPU memory in MB
mistralrs serve -m Qwen/Qwen3-4B --pa-memory-mb 4096
# Allocate fraction of GPU memory (0.0-1.0)
mistralrs serve -m Qwen/Qwen3-4B --pa-memory-fraction 0.8
Additional options:
| Option | Description |
|---|---|
--pa-block-size <SIZE> | Tokens per block (default: 32 on CUDA) |
--pa-cache-type <TYPE> | KV cache quantization type (default: auto) |
Device Mapping
Control how model layers are distributed across devices.
Automatic mapping:
# Use defaults (automatic)
mistralrs run -m Qwen/Qwen3-4B
Manual layer assignment:
# Assign 10 layers to GPU 0, 20 layers to GPU 1
mistralrs run -m Qwen/Qwen3-4B -n "0:10;1:20"
# Equivalent long form
mistralrs run -m Qwen/Qwen3-4B --device-layers "0:10;1:20"
CPU-only execution:
mistralrs run -m Qwen/Qwen3-4B --cpu
Topology file:
mistralrs run -m Qwen/Qwen3-4B --topology topology.yaml
Custom HuggingFace cache:
mistralrs run -m Qwen/Qwen3-4B --hf-cache /path/to/cache
Device mapping options:
| Option | Default | Description |
|---|---|---|
-n, --device-layers <MAPPING> | auto | Device layer mapping (format: ORD:NUM;...) |
--topology <PATH> | none | Topology YAML file for device mapping |
--hf-cache <PATH> | none | Custom HuggingFace cache directory |
--cpu | disabled | Force CPU-only execution |
--max-seq-len <LEN> | 4096 | Max sequence length for automatic device mapping |
--max-batch-size <SIZE> | 1 | Max batch size for automatic device mapping |
LoRA and X-LoRA
Apply LoRA or X-LoRA adapters to models.
LoRA:
# Single LoRA adapter
mistralrs run -m Qwen/Qwen3-4B --lora my-lora-adapter
# Multiple LoRA adapters (semicolon-separated)
mistralrs run -m Qwen/Qwen3-4B --lora "adapter1;adapter2"
X-LoRA:
# X-LoRA adapter with ordering file
mistralrs run -m Qwen/Qwen3-4B --xlora my-xlora-adapter --xlora-order ordering.json
# With target non-granular index
mistralrs run -m Qwen/Qwen3-4B --xlora my-xlora-adapter --xlora-order ordering.json --tgt-non-granular-index 2
Chat Templates
Override the model’s default chat template.
Use a template file:
# JSON template file
mistralrs run -m Qwen/Qwen3-4B --chat-template template.json
# Jinja template file
mistralrs run -m Qwen/Qwen3-4B --chat-template template.jinja
Explicit Jinja override:
mistralrs run -m Qwen/Qwen3-4B --jinja-explicit custom.jinja
Web Search
Enable web search capabilities (requires an embedding model).
# Enable search with default embedding model
mistralrs run -m Qwen/Qwen3-4B --enable-search
# Specify embedding model
mistralrs run -m Qwen/Qwen3-4B --enable-search --search-embedding-model embedding-gemma
Thinking Mode
Enable thinking/reasoning mode for models that support it (like DeepSeek, Qwen3).
mistralrs run -m Qwen/Qwen3-4B --enable-thinking
In interactive mode, thinking content is displayed in gray text before the final response.
Global Options
These options apply to all commands.
| Option | Default | Description |
|---|---|---|
--seed <SEED> | none | Random seed for reproducibility |
-l, --log <PATH> | none | Log all requests and responses to file |
--token-source <SOURCE> | cache | HuggingFace authentication token source |
-V, --version | N/A | Print version information and exit |
-h, --help | N/A | Print help message (use with any subcommand) |
Token source formats:
cache- Use cached HuggingFace token (default)literal:<token>- Use literal token valueenv:<var>- Read token from environment variablepath:<file>- Read token from filenone- No authentication
Examples:
# Set random seed
mistralrs run -m Qwen/Qwen3-4B --seed 42
# Enable logging
mistralrs run -m Qwen/Qwen3-4B --log requests.log
# Use token from environment variable
mistralrs run -m meta-llama/Llama-3.2-3B-Instruct --token-source env:HF_TOKEN
Runtime Options
These options are available for both run and serve commands.
| Option | Default | Description |
|---|---|---|
--max-seqs <N> | 32 | Maximum concurrent sequences |
--no-kv-cache | disabled | Disable KV cache entirely |
--prefix-cache-n <N> | 16 | Number of prefix caches to hold (0 to disable) |
-c, --chat-template <PATH> | none | Custom chat template file (.json or .jinja) |
-j, --jinja-explicit <PATH> | none | Explicit JINJA template override |
--enable-search | disabled | Enable web search |
--search-embedding-model <MODEL> | none | Embedding model for search |
Model Source Options
These options are common across model types.
| Option | Description |
|---|---|
-m, --model-id <ID> | HuggingFace model ID or local path (required) |
-t, --tokenizer <PATH> | Path to local tokenizer.json file |
-a, --arch <ARCH> | Model architecture (auto-detected if not specified) |
--dtype <TYPE> | Model data type (default: auto) |
Format Options
For loading quantized models.
| Option | Description |
|---|---|
--format <FORMAT> | Model format: plain, gguf, or ggml (auto-detected) |
-f, --quantized-file <FILE> | Quantized model filename(s) for GGUF/GGML (semicolon-separated) |
--tok-model-id <ID> | Model ID for tokenizer when using quantized format |
--gqa <VALUE> | GQA value for GGML models (default: 1) |
Examples:
# Load a GGUF model
mistralrs run -m Qwen/Qwen3-4B --format gguf -f model.gguf
# Multiple GGUF files
mistralrs run -m Qwen/Qwen3-4B --format gguf -f "model-part1.gguf;model-part2.gguf"
Interactive Commands
When running in interactive mode (mistralrs run), the following commands are available:
| Command | Description |
|---|---|
\help | Display help message |
\exit | Quit interactive mode |
\system <message> | Add a system message without running the model |
\clear | Clear the chat history |
\temperature <float> | Set sampling temperature (0.0 to 2.0) |
\topk <int> | Set top-k sampling value (>0) |
\topp <float> | Set top-p sampling value (0.0 to 1.0) |
Examples:
> \system Always respond as a pirate.
> \temperature 0.7
> \topk 50
> Hello!
Ahoy there, matey! What brings ye to these waters?
> \clear
> \exit
Vision Model Interactive Mode:
For vision models, you can include images in your prompts by specifying file paths or URLs:
> Describe this image: /path/to/image.jpg
> Compare these images: image1.png image2.png
> Describe the image and transcribe the audio: photo.jpg recording.mp3
Note: The CLI automatically detects paths to supported image and audio files within your prompt. You do not need special syntax; simply paste the absolute or relative path to the file.
Supported image formats: PNG, JPEG, BMP, GIF, WebP Supported audio formats: WAV, MP3, FLAC, OGG
mistralrs-cli TOML Config
mistralrs-cli can run entirely from a single TOML configuration file. This config supports multiple models and mirrors the CLI options.
Usage
mistralrs from-config --file path/to/config.toml
Quick Example
command = "serve"
[server]
port = 1234
ui = true
[runtime]
max_seqs = 32
[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"
[models.quantization]
in_situ_quant = "q4k"
Complete Reference
Top-Level Options
| Option | Commands | Description |
|---|---|---|
command | all | Required. Either "serve" or "run" |
enable_thinking | run | Enable thinking mode (default: false) |
default_model_id | serve | Default model ID for API requests (must match a model_id in [[models]]) |
[global] Section
Global options that apply to the entire run.
| Option | Default | Description |
|---|---|---|
seed | none | Random seed for reproducibility |
log | none | Log all requests/responses to this file path |
token_source | "cache" | HuggingFace auth: "cache", "none", "literal:<token>", "env:<var>", "path:<file>" |
[server] Section (serve only)
HTTP server configuration.
| Option | Default | Description |
|---|---|---|
port | 1234 | HTTP server port |
host | "0.0.0.0" | Bind address |
ui | false | Serve built-in web UI at /ui |
mcp_port | none | MCP protocol server port (enables MCP if set) |
mcp_config | none | MCP client configuration file path |
[runtime] Section
Runtime inference options.
| Option | Default | Description |
|---|---|---|
max_seqs | 32 | Maximum concurrent sequences |
no_kv_cache | false | Disable KV cache entirely |
prefix_cache_n | 16 | Number of prefix caches to hold (0 to disable) |
chat_template | none | Custom chat template file (.json or .jinja) |
jinja_explicit | none | Explicit JINJA template override |
enable_search | false | Enable web search |
search_embedding_model | none | Embedding model for search (e.g., "embedding-gemma") |
[paged_attn] Section
PagedAttention configuration.
| Option | Default | Description |
|---|---|---|
mode | "auto" | "auto" (CUDA on, Metal off), "on", or "off" |
context_len | none | Allocate KV cache for this context length |
memory_mb | none | GPU memory to allocate in MB (conflicts with context_len) |
memory_fraction | none | GPU memory utilization 0.0-1.0 (conflicts with above) |
block_size | 32 | Tokens per block |
cache_type | "auto" | KV cache type |
Note: If none of context_len, memory_mb, or memory_fraction are specified, defaults to 90% of available VRAM. Each are mutually exclusive.
[[models]] Section
Define one or more models. Each [[models]] entry creates a new model.
Top-Level Model Options
| Option | Required | Description |
|---|---|---|
kind | yes | Model type: "auto", "text", "vision", "diffusion", "speech", "embedding" |
model_id | yes | HuggingFace model ID or local path |
tokenizer | no | Path to local tokenizer.json |
arch | no | Model architecture (auto-detected if not specified) |
dtype | "auto" | Data type: "auto", "f16", "bf16", "f32" |
chat_template | no | Per-model chat template override |
jinja_explicit | no | Per-model JINJA template override |
[models.format] - Model Format
| Option | Default | Description |
|---|---|---|
format | auto | "plain" (safetensors), "gguf", or "ggml" |
quantized_file | none | Quantized filename(s) for GGUF/GGML (semicolon-separated) |
tok_model_id | none | Model ID for tokenizer when using quantized format |
gqa | 1 | GQA value for GGML models |
[models.adapter] - LoRA/X-LoRA
| Option | Description |
|---|---|
lora | LoRA adapter ID(s), semicolon-separated |
xlora | X-LoRA adapter ID (conflicts with lora) |
xlora_order | X-LoRA ordering JSON file (requires xlora) |
tgt_non_granular_index | Target non-granular index for X-LoRA |
[models.quantization] - ISQ/UQFF
| Option | Description |
|---|---|
in_situ_quant | ISQ level: "4", "8", "q4_0", "q4k", "q6k", etc. |
from_uqff | UQFF file(s) to load (semicolon-separated) |
isq_organization | ISQ strategy: "default" or "moqe" |
imatrix | imatrix file for enhanced quantization |
calibration_file | Calibration file for imatrix generation |
[models.device] - Device Mapping
| Option | Default | Description |
|---|---|---|
cpu | false | Force CPU-only (must be consistent across all models) |
device_layers | auto | Layer mapping, e.g., ["0:10", "1:20"] format: ORD:NUM;... |
topology | none | Topology YAML file |
hf_cache | none | Custom HuggingFace cache directory |
max_seq_len | 4096 | Max sequence length for auto device mapping |
max_batch_size | 1 | Max batch size for auto device mapping |
[models.vision] - Vision Options
| Option | Description |
|---|---|
max_edge | Maximum edge length for image resizing |
max_num_images | Maximum images per request |
max_image_length | Maximum image dimension for device mapping |
Full Examples
Multi-Model Server with UI
command = "serve"
[global]
seed = 42
[server]
host = "0.0.0.0"
port = 1234
ui = true
[runtime]
max_seqs = 32
enable_search = true
search_embedding_model = "embedding-gemma"
[paged_attn]
mode = "auto"
[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.2-3B-Instruct"
dtype = "auto"
[models.quantization]
in_situ_quant = "q4k"
[[models]]
kind = "vision"
model_id = "Qwen/Qwen2-VL-2B-Instruct"
[models.vision]
max_num_images = 4
[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"
Interactive Mode with Thinking
command = "run"
enable_thinking = true
[runtime]
max_seqs = 16
[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"
GGUF Model
command = "serve"
[server]
port = 1234
[[models]]
kind = "text"
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
[models.format]
format = "gguf"
quantized_file = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
Device Layer Mapping
command = "serve"
[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.1-70B-Instruct"
[models.device]
device_layers = ["0:40", "1:40"]
[models.quantization]
in_situ_quant = "q4k"
Notes
cpumust be consistent across all models if specifieddefault_model_id(serve only) must match amodel_idin[[models]]search_embedding_modelrequiresenable_search = true
Troubleshooting
Common issues and solutions for mistral.rs.
Debug Mode
Enable debug mode for more information:
MISTRALRS_DEBUG=1 mistralrs run -m <model>
Debug mode causes:
- If loading a GGUF or GGML model, outputs a file containing the names, shapes, and types of each tensor:
mistralrs_gguf_tensors.txtormistralrs_ggml_tensors.txt
- Increased logging verbosity
System Diagnostics
Run the built-in diagnostics tool:
mistralrs doctor
This checks your system configuration and reports any issues.
Common Issues
CUDA Issues
Setting the CUDA compiler path:
- Set the
NVCC_CCBINenvironment variable during build
Error: recompile with -fPIE:
- Some Linux distributions require compiling with
-fPIE - Set during build:
CUDA_NVCC_FLAGS=-fPIE cargo build --release --features cuda
Error: CUDA_ERROR_NOT_FOUND or symbol not found:
- For non-quantized models, specify the data type to load and run in
- Use one of
f32,f16,bf16orauto(auto chooses based on device) - Example:
mistralrs run -m <model> -d auto
Minimum CUDA compute capability:
- The minimum supported CUDA compute cap is 5.3
- Set a specific compute cap with:
CUDA_COMPUTE_CAP=80 cargo build --release --features cuda
Metal Issues (macOS)
Metal not found (error: unable to find utility “metal”):
-
Install Xcode:
xcode-select --install -
Set the active developer directory:
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
error: cannot execute tool ‘metal’ due to missing Metal toolchain
- Install Metal Toolchain:
xcodebuild -downloadComponent MetalToolchain
Disabling Metal kernel precompilation:
- By default, Metal kernels are precompiled during build time for better performance
- To skip precompilation (useful for CI or when Metal is not needed):
MISTRALRS_METAL_PRECOMPILE=0 cargo build --release --features metal
Memory Issues
Disabling mmap loading:
- Set
MISTRALRS_NO_MMAP=1to disable memory-mapped file loading - Forces all tensor data into memory
- Useful if you’re seeing mmap-related errors
Out of memory errors:
- Try using quantization:
--isq q4kor--isq q8_0 - Use device mapping to offload layers:
-n 0:16;cpu:16 - Reduce context length with PagedAttention:
--pa-context-len 4096
Model Loading Issues
Model type not auto-detected:
- If auto-detection fails, please raise an issue
- You can manually specify the architecture if needed
Chat template issues:
- Templates are usually auto-detected
- Override with:
-c /path/to/template.jinja - See Chat Templates for details
Getting Help
If you’re still stuck:
- Discord - Community support
- Matrix - Alternative chat
- GitHub Issues - Bug reports and feature requests
When reporting issues, please include:
- Output of
mistralrs doctor - Full error message
- Command you ran
- Hardware (GPU model, OS)
mistralrs Python SDK
Documentation for the mistralrs Python package.
Installation: See PYTHON_INSTALLATION.md for installation instructions.
Table of contents
- Full API reference: here
- Model configuration (
Whichenum): here - Multi-model support: here
- MCP Client Configuration: here
- Example: here
- Embeddings example: here
Which
Each *_model_id may be a HF hub repo or a local path. For quantized GGUF models, a list is accepted if multiple files must be specified.
Architecture for plain models
If you do not specify the architecture, an attempt will be made to use the model’s config. If this fails, please raise an issue.
MistralGemmaMixtralLlamaPhi2Phi3Qwen2Gemma2GLM4Starcoder2Phi3_5MoEDeepseekV2DeepseekV3Qwen3Qwen3MoeSmolLm3GraniteMoeHybridGptOss
ISQ Organization
DefaultMoQE: if applicable, only quantize MoE experts. https://arxiv.org/abs/2310.02410
Architecture for vision models
Phi3VIdefics2LLaVaNextLLaVaVLlamaQwen2VLIdefics3MiniCpmOPhi4MMQwen2_5VLGemma3Mistral3Llama4Gemma3nQwen3VL
Architecture for diffusion models
FluxFluxOffloaded
Architecture for speech models
Dia
Architecture for embedding models
EmbeddingGemmaQwen3Embedding
ISQ Organization
DefaultMoQE: if applicable, only quantize MoE experts. https://arxiv.org/abs/2310.02410
Note:
from_uqffspecified a UQFF path to load from. If provided, this takes precedence over applying ISQ. Specify multiple files using a semicolon delimiter (;).
Note:
enable_thinkingenables thinking for models that support the configuration. Note:truncate_sequence=Truetrims prompts that would otherwise exceed the model’s maximum context length. Leave itFalseto receive a validation error instead.
class Which(Enum):
@dataclass
class Plain:
model_id: str
arch: Architecture | None = None
tokenizer_json: str | None = None
topology: str | None = None
organization: str | None = None
from_uqff: str | list[str] | None = None
write_uqff: str | None = None
dtype: ModelDType = ModelDType.Auto
auto_map_params: TextAutoMapParams | None = (None,)
calibration_file: str | None = None
imatrix: str | None = None
hf_cache_path: str | None = None
@dataclass
class XLora:
xlora_model_id: str
order: str
arch: Architecture | None = None
model_id: str | None = None
tokenizer_json: str | None = None
tgt_non_granular_index: int | None = None
topology: str | None = None
from_uqff: str | list[str] | None = None
write_uqff: str | None = None
dtype: ModelDType = ModelDType.Auto
auto_map_params: TextAutoMapParams | None = (None,)
hf_cache_path: str | None = None
@dataclass
class Lora:
adapter_model_id: str
arch: Architecture | None = None
model_id: str | None = None
tokenizer_json: str | None = None
topology: str | None = None
from_uqff: str | list[str] | None = None
write_uqff: str | None = None
dtype: ModelDType = ModelDType.Auto
auto_map_params: TextAutoMapParams | None = (None,)
hf_cache_path: str | None = None
@dataclass
class GGUF:
quantized_model_id: str
quantized_filename: str | list[str]
tok_model_id: str | None = None
topology: str | None = None
dtype: ModelDType = ModelDType.Auto
auto_map_params: TextAutoMapParams | None = (None,)
@dataclass
class XLoraGGUF:
quantized_model_id: str
quantized_filename: str | list[str]
xlora_model_id: str
order: str
tok_model_id: str | None = None
tgt_non_granular_index: int | None = None
topology: str | None = None
dtype: ModelDType = ModelDType.Auto
auto_map_params: TextAutoMapParams | None = (None,)
@dataclass
class LoraGGUF:
quantized_model_id: str
quantized_filename: str | list[str]
adapters_model_id: str
order: str
tok_model_id: str | None = None
topology: str | None = None
dtype: ModelDType = ModelDType.Auto
auto_map_params: TextAutoMapParams | None = (None,)
@dataclass
class GGML:
quantized_model_id: str
quantized_filename: str
tok_model_id: str | None = None
tokenizer_json: str | None = None
gqa: int | None = None
topology: str | None = None
dtype: ModelDType = ModelDType.Auto
auto_map_params: TextAutoMapParams | None = (None,)
@dataclass
class XLoraGGML:
quantized_model_id: str
quantized_filename: str
xlora_model_id: str
order: str
tok_model_id: str | None = None
tgt_non_granular_index: int | None = None
tokenizer_json: str | None = None
gqa: int | None = None
topology: str | None = None
dtype: ModelDType = ModelDType.Auto
auto_map_params: TextAutoMapParams | None = (None,)
@dataclass
class LoraGGML:
quantized_model_id: str
quantized_filename: str
adapters_model_id: str
order: str
tok_model_id: str | None = None
tokenizer_json: str | None = None
topology: str | None = None
dtype: ModelDType = ModelDType.Auto
auto_map_params: TextAutoMapParams | None = (None,)
@dataclass
class Embedding:
model_id: str
arch: EmbeddingArchitecture | None = None
tokenizer_json: str | None = None
topology: str | None = None
from_uqff: str | list[str] | None = None
write_uqff: str | None = None
dtype: ModelDType = ModelDType.Auto
hf_cache_path: str | None = None
@dataclass
class VisionPlain:
model_id: str
arch: VisionArchitecture
tokenizer_json: str | None = None
topology: str | None = None
from_uqff: str | list[str] | None = None
write_uqff: str | None = None
dtype: ModelDType = ModelDType.Auto
max_edge: int | None = None
auto_map_params: VisionAutoMapParams | None = (None,)
calibration_file: str | None = None
imatrix: str | None = None
hf_cache_path: str | None = None
@dataclass
class DiffusionPlain:
model_id: str
arch: DiffusionArchitecture
dtype: ModelDType = ModelDType.Auto
@dataclass
class Speech:
model_id: str
arch: DiffusionArchitecture
dac_model_id: str | None = None
dtype: ModelDType = ModelDType.Auto
Multi-model Support
The mistralrs Python SDK supports running multiple models using the Runner class with the model_id parameter. All request methods accept an optional model_id to target a specific model. When model_id is None or omitted, the default model is used. If aliases are configured (for example via the server config or Rust MultiModelBuilder), list_models() will return those aliases and you can pass them in requests; canonical pipeline names remain accepted.
Basic Usage with model_id
import mistralrs
# Create a Runner with a vision model (Gemma 3 4B)
runner = mistralrs.Runner(
which=mistralrs.Which.VisionPlain(
model_id="google/gemma-3-4b-it",
arch=mistralrs.VisionArchitecture.Gemma3,
),
in_situ_quant="Q4K",
)
# List available models (model IDs are registered IDs, aliases if configured)
models = runner.list_models()
print(f"Available models: {models}") # ["google/gemma-3-4b-it"]
# Send request to specific model using model_id parameter
response = runner.send_chat_completion_request(
mistralrs.ChatCompletionRequest(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
),
model_id="google/gemma-3-4b-it" # Target specific model
)
# Send request without model_id (uses default model)
response = runner.send_chat_completion_request(
mistralrs.ChatCompletionRequest(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
)
Multi-model Management
# List available models
models = runner.list_models()
print(f"Available models: {models}")
# Get/set default model
default_model = runner.get_default_model_id()
print(f"Default model: {default_model}")
# Change default model (model must be loaded)
runner.set_default_model_id("google/gemma-3-4b-it")
# List models with their status
models_with_status = runner.list_models_with_status()
for model_id, status in models_with_status:
print(f"{model_id}: {status}") # status is "loaded", "unloaded", or "reloading"
Model Unloading and Reloading
You can unload models to free memory and reload them on demand:
model_id = "google/gemma-3-4b-it"
# Check if model is loaded
is_loaded = runner.is_model_loaded(model_id)
print(f"Model loaded: {is_loaded}")
# List models with their status
models_with_status = runner.list_models_with_status()
for mid, status in models_with_status:
print(f"{mid}: {status}")
# Unload a model to free memory (preserves configuration for reload)
runner.unload_model(model_id)
# Check status after unload
is_loaded = runner.is_model_loaded(model_id)
print(f"Model loaded after unload: {is_loaded}") # False
# Manually reload a model
runner.reload_model(model_id)
# Auto-reload: sending a request to an unloaded model will reload it automatically
response = runner.send_chat_completion_request(
mistralrs.ChatCompletionRequest(
messages=[{"role": "user", "content": "Hello!"}]
),
model_id=model_id # Will auto-reload if unloaded
)
Request Methods with model_id
All request methods accept an optional model_id parameter:
# Chat completion
response = runner.send_chat_completion_request(request, model_id="model-id")
# Completion
response = runner.send_completion_request(request, model_id="model-id")
# Embeddings
embeddings = runner.send_embedding_request(request, model_id="model-id")
# Image generation
image = runner.generate_image(prompt, response_format, model_id="model-id")
# Audio generation
audio = runner.generate_audio(prompt, model_id="model-id")
# Tokenization
tokens = runner.tokenize_text(text, add_special_tokens=True, model_id="model-id")
text = runner.detokenize_text(tokens, skip_special_tokens=True, model_id="model-id")
When model_id is None or omitted, the default model is used.
Server Configuration
For server-based multi-model deployment, see the multi-model documentation.
MCP Client
The mistralrs Python SDK now supports Model Context Protocol (MCP) clients, enabling AI assistants to connect to and interact with external tools and resources through standardized server interfaces.
MCP Server Configuration
Configure MCP servers using McpServerConfigPy:
# HTTP-based MCP server with Bearer token authentication
http_server = mistralrs.McpServerConfigPy(
id="web_search",
name="Web Search MCP",
source=mistralrs.McpServerSourcePy.Http(
url="https://api.example.com/mcp",
timeout_secs=30,
headers={"X-API-Version": "v1"} # Optional additional headers
),
enabled=True,
tool_prefix="web", # Prefixes tool names to avoid conflicts
resources=None,
bearer_token="your-api-token" # Automatically added as Authorization header
)
# Process-based MCP server for local tools
process_server = mistralrs.McpServerConfigPy(
id="filesystem",
name="Filesystem MCP",
source=mistralrs.McpServerSourcePy.Process(
command="mcp-server-filesystem",
args=["--root", "/tmp"],
work_dir=None,
env={"MCP_LOG_LEVEL": "debug"} # Optional environment variables
),
enabled=True,
tool_prefix="fs",
resources=["file://**"], # Resource patterns this client is interested in
bearer_token=None # Process servers typically don't need authentication
)
# WebSocket-based MCP server for real-time communication
websocket_server = mistralrs.McpServerConfigPy(
id="realtime_data",
name="Real-time Data MCP",
source=mistralrs.McpServerSourcePy.WebSocket(
url="wss://realtime.example.com/mcp",
timeout_secs=60,
headers=None
),
enabled=True,
tool_prefix="rt",
resources=None,
bearer_token="websocket-token" # WebSocket Bearer token support
)
MCP Client Configuration
Configure the MCP client using McpClientConfigPy:
mcp_config = mistralrs.McpClientConfigPy(
servers=[http_server, process_server, websocket_server],
auto_register_tools=True, # Automatically discover and register tools
tool_timeout_secs=30, # Timeout for individual tool calls
max_concurrent_calls=5 # Maximum concurrent tool calls across all servers
)
Integration with Runner
Pass the MCP client configuration to the Runner:
runner = mistralrs.Runner(
which=mistralrs.Which.GGUF(
tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
),
mcp_client_config=mcp_config # MCP tools automatically registered
)
When auto_register_tools=True, the MCP client will:
- Connect to all enabled MCP servers
- Discover available tools from each server
- Register them for automatic tool calling with appropriate prefixes
- Make them available during model conversations
MCP Transport Types
-
HTTP Transport: Best for public APIs, RESTful services, servers behind load balancers. Supports SSE (Server-Sent Events) and standard HTTP semantics.
-
Process Transport: Best for local tools, development servers, sandboxed environments. Provides process isolation with no network overhead.
-
WebSocket Transport: Best for interactive applications, real-time data, low-latency requirements. Supports persistent connections and server-initiated notifications.
Authentication
- Bearer Tokens: Automatically added as
Authorization: Bearer <token>header for HTTP and WebSocket connections - Custom Headers: Additional headers can be specified for API keys, versioning, etc.
- Process Servers: Typically don’t require authentication as they run locally
Example
from mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(
which=Which.GGUF(
tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
)
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[{"role":"user", "content":"Tell me a story about the Rust type system."}],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Embeddings example
from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which
runner = Runner(
which=Which.Embedding(
model_id="google/embeddinggemma-300m",
arch=EmbeddingArchitecture.EmbeddingGemma,
)
)
embeddings = runner.send_embedding_request(
EmbeddingRequest(
input=[
"task: query | text: superconductors",
"task: query | text: graphene",
],
truncate_sequence=True,
)
)
print(len(embeddings), len(embeddings[0]))
# Swap the model_id and arch below to load Qwen/Qwen3-Embedding-0.6B instead:
# Runner(
# which=Which.Embedding(
# model_id="Qwen/Qwen3-Embedding-0.6B",
# arch=EmbeddingArchitecture.Qwen3Embedding,
# )
# )
Python SDK Installation
Quick Install from PyPI (Recommended)
Pre-built wheels are available for common platforms. Choose the package that matches your hardware:
| Hardware | Install Command |
|---|---|
| Recommended (auto-optimized) | pip install mistralrs |
| NVIDIA GPUs (CUDA) | pip install mistralrs-cuda |
| Apple Silicon (Metal) | pip install mistralrs-metal |
| Apple Accelerate | pip install mistralrs-accelerate |
| Intel CPUs (MKL) | pip install mistralrs-mkl |
Platform-Specific Optimizations
The mistralrs base package includes platform-specific optimizations:
- macOS Apple Silicon: Metal GPU support built-in
- Linux/Windows x86_64: Intel MKL optimizations built-in
- Linux aarch64: CPU-only (use
mistralrs-cudafor GPU support)
All packages install the mistralrs Python module. The package suffix controls which accelerator features are enabled.
Supported Platforms
| Package | Linux x86_64 | Linux aarch64 | Windows x86_64 | macOS aarch64 |
|---|---|---|---|---|
| mistralrs | MKL | CPU | MKL | Metal |
| mistralrs-cuda | CUDA | CUDA | CUDA | - |
| mistralrs-metal | - | - | - | Metal |
| mistralrs-accelerate | - | - | - | Accelerate |
| mistralrs-mkl | MKL | - | MKL | - |
Python version: 3.10+ (wheels use abi3 for forward compatibility)
Windows Requirements
It is recommended to use WSL2 on Windows machines.
On Windows, additional runtime dependencies may be required:
- CUDA packages: Install the NVIDIA CUDA Toolkit and ensure the
bindirectory is in your PATH - MKL packages: Install the Intel oneAPI Math Kernel Library runtime
# Example: Install with CUDA support
pip install mistralrs-cuda -v
Build from Source
Building from source gives you access to the latest features and allows customization of build options.
Prerequisites
-
Install system packages:
Ubuntu/Debian:
sudo apt install libssl-dev pkg-configmacOS:
brew install openssl pkg-config -
Install Rust from https://rustup.rs/:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env -
(Optional) Set up HuggingFace authentication for gated models:
mkdir -p ~/.cache/huggingface echo "YOUR_HF_TOKEN" > ~/.cache/huggingface/tokenOr use
huggingface-cli login.
Build Steps
-
Clone the repository:
git clone https://github.com/EricLBuehler/mistral.rs.git cd mistral.rs/mistralrs-pyo3 -
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # Linux/macOS # or: .venv\Scripts\activate # Windows -
Install maturin (Rust + Python build tool):
pip install maturin[patchelf] -
Build and install:
maturin develop -r --features <your-features>
Feature Flags
| Feature | Description |
|---|---|
cuda | NVIDIA GPU support |
flash-attn | Flash Attention (CUDA, Ampere+) |
flash-attn-v3 | Flash Attention v3 (CUDA, Hopper) |
cudnn | cuDNN optimizations |
metal | Apple Silicon GPU (macOS only) |
accelerate | Apple Accelerate framework |
mkl | Intel MKL |
Example with CUDA and Flash Attention:
maturin develop -r --features "cuda flash-attn cudnn"
Verify Installation
import mistralrs
print(mistralrs.__version__)
Quick test:
from mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(
which=Which.Plain(model_id="Qwen/Qwen3-0.6B"),
)
response = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=50,
)
)
print(response.choices[0].message.content)
Next Steps
- SDK Documentation - Full SDK reference
- Examples - Python examples
- Cookbook - Interactive tutorial
HTTP server
Mistral.rs provides a lightweight OpenAI API compatible HTTP server based on axum. The request and response formats are supersets of the OpenAI API.
The API consists of the following endpoints. They can be viewed in your browser interactively by going to http://localhost:<port>/docs.
ℹ️ Besides the HTTP endpoints described below,
mistralrs servecan also expose the same functionality via the MCP protocol. Enable it with--mcp-port <port>and see MCP/server.md for details.
Additional object keys
To support additional features, we have extended the completion and chat completion request objects. Both have the same keys added:
top_k:int|null. If non null, it is only relevant if positive.grammar:{"type" : "regex" | "lark" | "json_schema" | "llguidance", "value": string}ornull. Grammar to use. This is mutually exclusive to the OpenAI-compatibleresponse_format.min_p:float|null. If non null, it is only relevant if 1 >= min_p >= 0.enable_thinking:bool, default tofalse. Enable thinking for models that support it.truncate_sequence:bool|null. Whentrue, requests that exceed the model context length will be truncated instead of rejected; otherwise the server returns a validation error. Embedding requests truncate tokens at the end of the prompt, while chat/completion requests truncate tokens at the start of the prompt.repetition_penalty:float|null. Penalty for repeating tokens. This is distinct fromfrequency_penaltyandpresence_penalty- it applies a direct multiplicative penalty to repeated token logits.web_search_options:object|null. Enable web search integration (see WEB_SEARCH.md). Contains optional fields:search_context_size(“low”, “medium”, “high”),user_location(object with location info),search_description(override search tool description),extract_description(override extraction tool description).reasoning_effort:string|null. For Harmony-format models (like GPT-OSS), controls the depth of reasoning:"low","medium", or"high".dry_multiplier:float|null. DRY (Don’t Repeat Yourself) sampling multiplier. Controls the strength of the anti-repetition penalty.dry_base:float|null. DRY sampling base value.dry_allowed_length:int|null. DRY sampling allowed length before penalty applies.dry_sequence_breakers:array of strings|null. Tokens that reset the DRY penalty sequence.
Response Extensions
The response objects include additional fields beyond the standard OpenAI API:
Harmony Mode Responses
For models using Harmony format (like GPT-OSS), responses may include additional reasoning content:
reasoning_content:string|null. Chain-of-thought reasoning from Harmony-format models. This field contains the model’s internal analysis and commentary that led to the final response. It is separate from the maincontentfield.
When streaming, reasoning_content appears in the delta object alongside content.
Example response:
{
"choices": [{
"message": {
"role": "assistant",
"content": "The answer is 42.",
"reasoning_content": "Let me analyze this step by step..."
}
}]
}
Model Parameter Validation
Mistral.rs validates that the model parameter in API requests matches the model that was actually loaded by the server. This ensures requests are processed by the correct model and prevents confusion.
Behavior:
- If the
modelparameter matches the loaded model name, the request proceeds normally - If the
modelparameter doesn’t match, the request fails with an error message indicating the mismatch - The special model name
"default"can be used to bypass this validation entirely
Examples:
- ✅ Request with
"model": "meta-llama/Llama-3.2-3B-Instruct"whenmeta-llama/Llama-3.2-3B-Instructis loaded → succeeds - ❌ Request with
"model": "gpt-4"whenmistral-7b-instructis loaded → fails - ✅ Request with
"model": "default"regardless of loaded model → always succeeds
Usage: Use "default" in the model field when you need to satisfy API clients that require a model parameter but don’t need to specify a particular model. This is demonstrated in all the examples below.
POST: /v1/chat/completions
Process an OpenAI compatible request, returning an OpenAI compatible response when finished. Please find the official OpenAI API documentation here. To control the interval keep-alive messages are sent, set the KEEP_ALIVE_INTERVAL environment variable to the desired time in ms.
To send a request with the Python openai library:
import openai
client = openai.OpenAI(
base_url="http://localhost:1234/v1", # "http://<Your api-server IP>:port"
api_key = "EMPTY"
)
completion = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are Mistral.rs, an AI assistant."},
{"role": "user", "content": "Write a story about Rust error handling."}
]
)
print(completion.choices[0].message)
Or with curl:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"messages": [
{
"role": "system",
"content": "You are Mistral.rs, an AI assistant."
},
{
"role": "user",
"content": "Write a story about Rust error handling."
}
]
}'
A streaming request can also be created by setting "stream": true in the request JSON. Please see this guide.
ℹ️ Requests whose prompt exceeds the model’s maximum context length now fail unless you opt in to truncation. Set
"truncate_sequence": trueto drop the oldest prompt tokens while reserving room (equal tomax_tokenswhen provided, otherwise one token) for generation. Specifically, tokens from the front of the prompt are dropped.
GET: /v1/models
Returns the running models.
Example with curl:
curl http://localhost:<port>/v1/models
GET: / or /health
Returns the server health.
Example with curl:
curl http://localhost:<port>/health
GET: /docs
Returns OpenAPI API docs via SwaggerUI.
Example with curl:
curl http://localhost:<port>/docs
POST: /v1/completions
Process an OpenAI compatible completions request, returning an OpenAI compatible response when finished. Please find the official OpenAI API documentation here.
Completions-specific parameters
In addition to the common parameters listed above, the completions endpoint supports:
best_of:int|null. Generatebest_ofcompletions server-side and return the best one (the one with the highest log probability per token). When used withn,best_ofmust be greater thann.echo:bool, defaultfalse. Echo back the prompt in addition to the completion.
To send a request with the Python openai library:
import openai
client = openai.OpenAI(
base_url="http://localhost:1234/v1", # "http://<Your api-server IP>:port"
api_key = "EMPTY"
)
completion = client.completions.create(
model="default",
prompt="What is Rust?",
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
print(completion.choices[0].message)
Or with curl:
curl http://localhost:1234/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"prompt": "What is Rust?"
}'
ℹ️ The
truncate_sequenceflag behaves the same way for the completions endpoint: keep itfalse(default) to receive a validation error, or set it totrueto trim the prompt automatically.
POST: /v1/embeddings
Serve an embedding model (for example, EmbeddingGemma) to enable this endpoint:
mistralrs serve -m google/embeddinggemma-300m
In multi-model mode, include an Embedding entry in your selector config to expose it alongside chat models.
Create vector embeddings via the OpenAI-compatible endpoint. Supported request fields:
input: a single string, an array of strings, an array of token IDs ([123, 456]), or a batch of token arrays ([[...], [...]]).encoding_format:"float"(default) returns arrays off32;"base64"returns Base64 strings.dimensions: currently unsupported; providing it yields a validation error.truncate_sequence:bool, defaultfalse. Set totrueto clip over-length prompts instead of receiving a validation error.
ℹ️ Requests whose prompt exceeds the model’s maximum context length now fail unless you opt in to truncation. Embedding requests truncate tokens from the end of the prompt.
Example (Python openai client):
import openai
client = openai.OpenAI(
base_url="http://localhost:1234/v1",
api_key="EMPTY",
)
result = client.embeddings.create(
model="default",
input=[
"Embeddings capture semantic relationships between texts.",
"What is graphene?",
],
truncate_sequence=True,
)
for item in result.data:
print(item.index, len(item.embedding))
Example with curl:
curl http://localhost:1234/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"input": ["graphene conductivity", "superconductor basics"],
"encoding_format": "base64",
"truncate_sequence": false
}'
Responses follow the OpenAI schema: object: "list", data[*].embedding containing either float arrays or Base64 strings depending on encoding_format, and a usage block (prompt_tokens, total_tokens). At present those counters report 0 because token accounting for embeddings is not yet implemented.
POST: /v1/images/generations
Generate images using diffusion models (like FLUX). First, serve a diffusion model:
mistralrs serve -m black-forest-labs/FLUX.1-schnell
Supported request fields:
model: Model identifier (use"default"to bypass validation)prompt: Text description of the image to generaten: Number of images to generate (default: 1)response_format:"url"or"b64_json"(default:"url")height: Image height in pixels (default: 720)width: Image width in pixels (default: 1280)
Example with Python:
import openai
import base64
client = openai.OpenAI(
base_url="http://localhost:1234/v1",
api_key="EMPTY",
)
response = client.images.generate(
model="default",
prompt="A majestic snow-covered mountain at sunset",
n=1,
response_format="b64_json",
size="1280x720", # width x height
)
# Save the generated image
image_data = base64.b64decode(response.data[0].b64_json)
with open("output.png", "wb") as f:
f.write(image_data)
Example with curl:
curl http://localhost:1234/v1/images/generations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"prompt": "A majestic snow-covered mountain at sunset",
"n": 1,
"response_format": "b64_json",
"height": 720,
"width": 1280
}'
POST: /v1/audio/speech
Generate speech from text using speech models (like Dia). First, serve a speech model:
mistralrs serve -m nari-labs/Dia-1.6B
Supported request fields:
model: Model identifier (use"default"to bypass validation)input: Text to convert to speech. For Dia models, use speaker tags like[S1]and[S2]to control multiple voicesresponse_format:"wav"or"pcm"(only these formats are supported)
Note: The
voiceandinstructionsfields from the OpenAI API are currently ignored.
Example with Python:
import requests
response = requests.post(
"http://localhost:1234/v1/audio/speech",
headers={
"Content-Type": "application/json",
"Authorization": "Bearer EMPTY",
},
json={
"model": "default",
"input": "[S1] Hello, how are you today? [S2] I'm doing great, thanks for asking!",
"response_format": "wav",
},
)
# Save the audio file
with open("output.wav", "wb") as f:
f.write(response.content)
Example with curl:
curl http://localhost:1234/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"input": "[S1] Dia is an open weights text to dialogue model. [S2] Try it now!",
"response_format": "wav"
}' \
--output output.wav
The response is raw audio data with the appropriate Content-Type header (audio/wav for WAV format, audio/pcm for PCM format).
POST: /v1/responses
Create a response using the OpenAI-compatible Responses API. Please find the official OpenAI API documentation here.
To send a request with the Python openai library:
import openai
client = openai.OpenAI(
base_url="http://localhost:1234/v1",
api_key = "EMPTY"
)
# First turn
resp1 = client.responses.create(
model="default",
input="Apples are delicious!"
)
print(resp1.output_text)
# Follow-up - no need to resend the first message
resp2 = client.responses.create(
model="default",
previous_response_id=resp1.id,
input="Can you eat them?"
)
print(resp2.output_text)
Or with curl:
curl http://localhost:1234/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"input": "Tell me about Rust programming"
}'
# Follow-up using previous_response_id
curl http://localhost:1234/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"previous_response_id": "resp_12345-uuid-here",
"input": "What makes it memory safe?"
}'
The API also supports multimodal inputs (images, audio) and streaming responses by setting "stream": true in the request JSON.
ℹ️ The Responses API forwards
truncate_sequenceto underlying chat completions. Enable it if you want over-length conversations to be truncated rather than rejected.
GET: /v1/responses/{response_id}
Retrieve a previously created response by its ID.
Example with curl:
curl http://localhost:1234/v1/responses/resp_12345-uuid-here \
-H "Authorization: Bearer EMPTY"
DELETE: /v1/responses/{response_id}
Delete a stored response and its associated conversation history.
Example with curl:
curl -X DELETE http://localhost:1234/v1/responses/resp_12345-uuid-here \
-H "Authorization: Bearer EMPTY"
POST: /re_isq
Reapply ISQ to the model if possible. Pass the names as a JSON object with the key ggml_type to a string (the quantization level).
Example with curl:
curl http://localhost:<port>/re_isq -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{"ggml_type":"4"}'
Model Management Endpoints
These endpoints allow dynamic management of loaded models, enabling you to free memory by unloading models and reload them on demand.
POST: /v1/models/unload
Unload a model from memory while preserving its configuration for later reload. The model can be reloaded manually or will auto-reload when a request is sent to it.
Request body:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
Response:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"status": "unloaded"
}
Example with curl:
curl -X POST http://localhost:1234/v1/models/unload \
-H "Content-Type: application/json" \
-d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'
POST: /v1/models/reload
Manually reload a previously unloaded model. This is also triggered automatically when a request is sent to an unloaded model.
Request body:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
Response:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"status": "loaded"
}
Example with curl:
curl -X POST http://localhost:1234/v1/models/reload \
-H "Content-Type: application/json" \
-d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'
POST: /v1/models/status
Get the current status of a specific model.
Request body:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
Response:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"status": "loaded"
}
Example with curl:
curl -X POST http://localhost:1234/v1/models/status \
-H "Content-Type: application/json" \
-d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'
Status Values
The status field in responses can be one of:
| Status | Description |
|---|---|
loaded | Model is loaded and ready to serve requests |
unloaded | Model is unloaded but can be reloaded |
reloading | Model is currently being reloaded |
not_found | Model ID not recognized |
no_loader_config | Model cannot be reloaded (missing loader configuration) |
internal_error | An internal error occurred (check error field for details) |
When an error occurs, the response may include an error field with additional details:
{
"model_id": "unknown-model",
"status": "not_found",
"error": null
}
Auto-Reload Behavior
When a request (e.g., chat completion) is sent to an unloaded model, the model will automatically reload before processing the request. This enables a “lazy loading” pattern where models are only loaded when needed, helping manage GPU memory efficiently.
Models List with Status
The /v1/models endpoint includes a status field for each model:
curl http://localhost:1234/v1/models
Response:
{
"object": "list",
"data": [
{
"id": "default",
"object": "model",
"created": 1234567890,
"owned_by": "local"
},
{
"id": "meta-llama/Llama-3.2-3B-Instruct",
"object": "model",
"created": 1234567890,
"owned_by": "local",
"status": "loaded"
}
]
}
OpenResponses API
mistral.rs supports the OpenResponses API specification.
Endpoints
POST /v1/responses- Create a responseGET /v1/responses/{id}- Retrieve a responseDELETE /v1/responses/{id}- Delete a responsePOST /v1/responses/{id}/cancel- Cancel a background response
Unsupported Parameters
The following parameters are accepted for API compatibility but will return errors if set to non-default values:
| Parameter | Behavior |
|---|---|
parallel_tool_calls | Only true or omitted is supported; false returns an error |
max_tool_calls | Not supported; setting any value returns an error |
mistral.rs Extensions
These additional parameters are available beyond the spec:
stop- Stop sequencesrepetition_penalty- Token repetition penaltytop_k- Top-k samplinggrammar- Constrained generation grammarmin_p- Min-p samplingdry_multiplier,dry_base,dry_allowed_length,dry_sequence_breakers- DRY samplingweb_search_options- Web search integration
See HTTP.md for usage examples.
Supported Models
Complete reference for model support in mistral.rs.
Model Categories
Text Models
- Granite 4.0
- SmolLM 3
- DeepSeek V3
- GPT-OSS
- DeepSeek V2
- Qwen 3 MoE
- Phi 3.5 MoE
- Qwen 3
- GLM 4
- GLM-4.7-Flash
- GLM-4.7 (MoE)
- Gemma 2
- Qwen 2
- Starcoder 2
- Phi 3
- Mixtral
- Phi 2
- Gemma
- Llama
- Mistral
Vision Models
- Qwen 3-VL
- Gemma 3n
- Llama 4
- Gemma 3
- Mistral 3
- Phi 4 multimodal
- Qwen 2.5-VL
- MiniCPM-O
- Llama 3.2 Vision
- Qwen 2-VL
- Idefics 3
- Idefics 2
- LLaVA Next
- LLaVA
- Phi 3V
Speech Models
- Dia
Image Generation Models
- FLUX
Embedding Models
- Embedding Gemma
- Qwen 3 Embedding
Supported GGUF Architectures
Plain:
- llama
- phi2
- phi3
- starcoder2
- qwen2
- qwen3
With adapters:
- llama
- phi3
Quantization Support
| Model | GGUF | GGML | ISQ |
|---|---|---|---|
| Mistral | ✅ | ✅ | |
| Gemma | ✅ | ||
| Llama | ✅ | ✅ | ✅ |
| Mixtral | ✅ | ✅ | |
| Phi 2 | ✅ | ✅ | |
| Phi 3 | ✅ | ✅ | |
| Phi 3.5 MoE | ✅ | ||
| Qwen 2.5 | ✅ | ||
| Phi 3 Vision | ✅ | ||
| Idefics 2 | ✅ | ||
| Gemma 2 | ✅ | ||
| GLM4 | ✅ | ||
| GLM-4.7-Flash (MoE) | ✅ | ||
| GLM-4.7 (MoE) | ✅ | ||
| Starcoder 2 | ✅ | ✅ | |
| LLaVa Next | ✅ | ||
| LLaVa | ✅ | ||
| Llama 3.2 Vision | ✅ | ||
| Qwen2-VL | ✅ | ||
| Idefics 3 | ✅ | ||
| Deepseek V2 | ✅ | ||
| Deepseek V3 | ✅ | ||
| MiniCPM-O 2.6 | ✅ | ||
| Qwen2.5-VL | ✅ | ||
| Gemma 3 | ✅ | ||
| Mistral 3 | ✅ | ||
| Llama 4 | ✅ | ||
| Qwen 3 | ✅ | ✅ | |
| SmolLM3 | ✅ | ||
| Dia 1.6b | ✅ | ||
| Gemma 3n | ✅ | ||
| Qwen 3 VL | ✅ | ||
| Granite 4.0 | ✅ | ||
| GPT-OSS | ✅ |
Device Mapping Support
| Model category | Supported |
|---|---|
| Plain | ✅ |
| GGUF | ✅ |
| GGML | |
| Vision Plain | ✅ |
X-LoRA and LoRA Support
| Model | X-LoRA | X-LoRA+GGUF | X-LoRA+GGML |
|---|---|---|---|
| Mistral | ✅ | ✅ | |
| Gemma | ✅ | ||
| Llama | ✅ | ✅ | ✅ |
| Mixtral | ✅ | ✅ | |
| Phi 2 | ✅ | ||
| Phi 3 | ✅ | ✅ | |
| Phi 3.5 MoE | |||
| Qwen 2.5 | |||
| Phi 3 Vision | |||
| Idefics 2 | |||
| Gemma 2 | ✅ | ||
| GLM4 | ✅ | ||
| GLM-4.7-Flash (MoE) | |||
| GLM-4.7 (MoE) | |||
| Starcoder 2 | ✅ | ||
| LLaVa Next | |||
| LLaVa | |||
| Qwen2-VL | |||
| Idefics 3 | |||
| Deepseek V2 | |||
| Deepseek V3 | |||
| MiniCPM-O 2.6 | |||
| Qwen2.5-VL | |||
| Gemma 3 | |||
| Mistral 3 | |||
| Llama 4 | |||
| Qwen 3 | |||
| SmolLM3 | ✅ | ||
| Gemma 3n | |||
| Qwen 3 VL | |||
| Granite 4.0 | |||
| GPT-OSS |
AnyMoE Support
| Model | AnyMoE |
|---|---|
| Mistral 7B | ✅ |
| Gemma | ✅ |
| Llama | ✅ |
| Mixtral | |
| Phi 2 | ✅ |
| Phi 3 | ✅ |
| Phi 3.5 MoE | |
| Qwen 2.5 | ✅ |
| Phi 3 Vision | |
| Idefics 2 | |
| Gemma 2 | ✅ |
| GLM-4.7-Flash (MoE) | |
| GLM-4.7 (MoE) | |
| Starcoder 2 | ✅ |
| LLaVa Next | ✅ |
| LLaVa | ✅ |
| Llama 3.2 Vision | |
| Qwen2-VL | |
| Idefics 3 | ✅ |
| Deepseek V2 | |
| Deepseek V3 | |
| MiniCPM-O 2.6 | |
| Qwen2.5-VL | |
| Gemma 3 | ✅ |
| Mistral 3 | ✅ |
| Llama 4 | |
| Qwen 3 | |
| SmolLM3 | ✅ |
| Gemma 3n | |
| Qwen 3 VL | |
| Granite 4.0 | |
| GPT-OSS |
Using Derivative Models
Model type is auto-detected. Use flags for quantized models and adapters:
| Model Type | Required Arguments |
|---|---|
| Plain | -m <model-id> |
| GGUF Quantized | -m <model-id> --format gguf -f <file> |
| ISQ Quantized | -m <model-id> --isq <level> |
| UQFF Quantized | -m <model-id> --from-uqff <file> |
| LoRA | -m <model-id> --lora <adapter> |
| X-LoRA | -m <model-id> --xlora <adapter> --xlora-order <file> |
Example: Zephyr GGUF model
mistralrs serve -p 1234 --log output.txt --format gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf
Chat Templates and Tokenizer
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized.
Vision model support in mistral.rs
Mistral.rs supports various modalities of models, including vision models. Vision models take images and text as input and have the capability to reason over both.
Please see docs for the following model types:
- Phi 3 Vision: PHI3V.md
- Idefics2: IDEFICS2.md
- LLaVA and LLaVANext: LLAVA.md
- Llama 3.2 Vision: VLLAMA.md
- Qwen2-VL: QWEN2VL.md
- Idefics 3 and Smol VLM: IDEFICS3.md
- Phi 4 Multimodal: PHI4MM.md
Note for the Python and HTTP APIs: We follow the OpenAI specification for structuring the image messages and allow both base64 encoded images as well as a URL/path to the image. There are many examples of this, see this Python example.
Image generation model support in mistral.rs
Mistral.rs supports various modalities of models, including image generation models. Image generation models take text as input and generate images.
Please see docs for the following model types:
- FLUX.1 FLUX.md
Embeddings Overview
Mistral.rs can load embedding models alongside chat, vision, diffusion, and speech workloads. Embedding models produce dense vector representations that you can use for similarity search, clustering, reranking, and other semantic tasks.
Supported models
| Model | Notes | Documentation |
|---|---|---|
| EmbeddingGemma | Google’s multilingual embedding model. | EMBEDDINGGEMMA.md |
| Qwen3 Embedding | Qwen’s general-purpose embedding encoder. | QWEN3_EMBEDDING.md |
Have another embedding model you would like supported? Open an issue with the model ID and configuration.
Usage overview
- Choose a model from the table above.
- Load it through one of our APIs:
- CLI/HTTP
- Python
- Rust
Detailed examples for each model live in their dedicated documentation pages.
DeepSeek V2: deepseek-ai/DeepSeek-V2-Lite
The DeepSeek V2 is a mixture of expert (MoE) model featuring “Multi-head Latent Attention”.
- Context length of 32k tokens (Lite model), 128k tokens (full model)
- 64 routed experts (Lite model), 160 routed experts (full model)
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-V2-Lite
Note
This model supports MoQE which can be activated in the ISQ organization parameter within the various APIs, as demonstrated below:
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-V2-Lite --isq-organization moqe
HTTP API
mistralrs serve --isq 4 -p 1234 -m deepseek-ai/DeepSeek-V2-Lite
import openai
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="deepseek-ai/DeepSeek-V2-Lite",
arch=Architecture.DeepseekV2,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
You can find this example here.
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("deepseek-ai/DeepSeek-V2-Lite")
.with_isq(IsqType::Q4K)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
DeepSeek V3: deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-R1
The DeepSeek V3 is a mixture of expert (MoE) model.
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1
Note
The non-distill versions of the DeepSeek R1 models share the DeepSeek V3 architecture.
Note
This model supports MoQE which can be activated in the ISQ organization parameter within the various APIs, as demonstrated below:
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1 --isq-organization moqe
Running the distill models
The various distillation models can be run out of the box.
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
HTTP API
mistralrs serve --isq 4 -p 1234 -m deepseek-ai/DeepSeek-R1
import openai
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="deepseek-ai/DeepSeek-R1",
arch=Architecture.DeepseekV3,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
You can find this example here.
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("deepseek-ai/DeepSeek-R1")
.with_isq(IsqType::Q4K)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Gemma 2 Model
See the Gemma 2 model Collection
The Gemma 2 models are a family of text-to-text decoder-only LLMs. As such, the methods to use them are the same as with all other text-to-text LLMs supported by mistral.rs.
HTTP API
import openai
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="google/gemma-2-9b-it",
arch=Architecture.Gemma2,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Gemma 3 Model: google/gemma-3-4b-it
The Gemma 3 model is a family of multimodal (text+vision) models with 128k context length. The collection can be found here, with model sizes ranging from 4B to 27B.
We support the Gemma 3 Model in the Rust, Python, and HTTP APIs, including ISQ for increased performance.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:
Credit
Prompt:
What is this?
Output:
image shows Mount Washington in New Hampshire, USA. It's a prominent peak in the White Mountains, known for its extreme weather conditions and being the highest peak in the Northeastern United States. The image captures it covered in snow with a dramatic sky above. The structures at the summit are communication towers.
The winding path visible on the mountain slopes appears to be part of the Mount Washington Auto Road, a historic road that allows vehicles to drive to the summit.
- Start the server
mistralrs serve vision -p 1234 -m google/gemma-3-12b-it
- Send a request
from openai import OpenAI
import httpx
import textwrap
import json
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
This is a minimal example of running the Gemma 3 model with a dummy image.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model =
VisionModelBuilder::new("google/gemma-3-12b-it")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3-12b-it",
arch=VisionArchitecture.Gemma3,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Gemma 3n Model: google/gemma-3n-E4B-it
Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs. These models support over 140 spoken languages.
The Gemma 3n Model has support in the Rust, Python, and HTTP APIs. Additionally, the Gemma 3n Model supports ISQ for increased performance.
-
Full multimodal support: mistral.rs supports text, audio, and vision inputs to Gemma 3n!
-
🪆 mistral.rs supports dynamically resizing the Gemma 3n model with that MatFormer architecture!
Gemma 3n implements the MatFormer architecture, which allows one model to be resized dynamically and tune performance on resource-constrained systems.
Mistral.rs supports this feature!
You can access it using the
matformer_config_path(example config) andmatformer_slice_namearguments throughout the APIs. -
Prequantized UQFF models:
Using MatFormer with Gemma 3n
MatFormer allows you to dynamically adjust the model size based on your resource constraints. The Gemma 3n model comes with several pre-configured slices that offer different performance/resource trade-offs.
You can read more about MatFormer in mistral.rs here.
Available Slices
The default configuration file (matformer_configs/gemma3n.csv) includes:
- Main model (3.98B params, 35 layers) - Full model with best performance
- Config for official E2B Model (1.91B params, 30 layers) - Balanced performance/efficiency
- Various intermediate configurations from E1.96B to E3.79B with different layer and FFN configurations
Command Line Example
# Run with the E2.49B slice for balanced performance/efficiency
mistralrs run vision -m google/gemma-3n-E4B-it \
--matformer-config-path matformer_configs/gemma3n.csv \
--matformer-slice-name "Config for E2.49B (block-level)"
Python SDK Example
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
# Use the E2.49B slice for balanced performance/efficiency
runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3n-E4B-it",
arch=VisionArchitecture.Gemma3n,
matformer_config_path="matformer_configs/gemma3n.csv",
matformer_slice_name="Config for E2.49B (block-level)",
),
)
# The model will use 35 layers with mixed FFN dimensions (4096 for early layers, 8192 for middle)
# This results in ~37% parameter reduction while maintaining better performance than E2B
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="ignore",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What do you see in this image?",
},
],
}
],
max_tokens=100,
)
)
print(res.choices[0].message.content)
Rust SDK Example
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
use std::path::PathBuf;
#[tokio::main]
async fn main() -> Result<()> {
// Build model with MatFormer E2.49B configuration
let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
.with_isq(IsqType::Q4K)
.with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
.with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"Describe this image briefly.",
image,
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
println!("Using E2.49B slice: 35 layers, 2.49B effective params");
Ok(())
}
Choosing the Right Slice
- Resource-constrained environments: Use “Config for official E2B Model” (1.91B params)
- Balanced performance: Try E2.49B to E2.98B configurations (block-level configs offer better balance)
- Maximum quality: Use “Main model” (3.98B params) or omit MatFormer configuration entirely
The slice selection allows you to:
- Reduce memory usage proportionally to the parameter count
- Speed up inference roughly linearly with the number of layers
- Maintain acceptable quality for many use cases with smaller slices
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:

Credit
Prompt:
Please describe this image in detail.
Output:
The image captures a breathtaking, wide-angle view of a majestic mountain covered in a blanket of snow. The mountain dominates the frame, its peak reaching towards a partly cloudy sky. The snow cover is uneven, with patches of exposed dark rock and textured snow formations creating a visually interesting surface.
A winding, snow-covered path or road snakes its way up the mountainside, appearing as a bright white line against the darker slopes. This path draws the eye upwards towards the summit, where a few structures, possibly communication towers or observation points, are visible.
The lower slopes of the mountain are covered in a dense forest of evergreen trees, their dark green hues contrasting beautifully with the white snow. The forest extends down into a valley, hinting at a wider landscape beyond the frame.
The sky above is a mix of pale blue and soft grey clouds, with some darker, more dramatic cloud formations near the top of the mountain. The lighting suggests it might be early morning or late afternoon, casting subtle shadows across the mountain's surface and highlighting its contours.
The overall impression is one of grandeur, tranquility, and the raw beauty of a winter landscape. The scale of the mountain is impressive, and the winding path invites a sense of exploration and adventure.
- Start the server
mistralrs serve vision -p 1234 -m google/gemma-3n-E4B-it
# Or with MatFormer for balanced performance:
mistralrs serve vision -p 1234 -m google/gemma-3n-E4B-it \
--matformer-config-path matformer_configs/gemma3n.csv \
--matformer-slice-name "Config for E2.49B (block-level)"
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="ignore",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "Please describe this image in detail.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
This is a minimal example of running the Gemma 3n model with a dummy image.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model =
VisionModelBuilder::new("google/gemma-3n-E4B-it")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"Please describe the image in detail.",
image,
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3n-E4B-it",
arch=VisionArchitecture.Gemma3n,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="ignore",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "Please describe this image in detail.",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
OpenAI HTTP API
Audio is delivered with the audio_url content-type that mirrors OpenAIʼs official specification:
{
"role": "user",
"content": [
{
"type": "audio_url",
"audio_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg" }
},
{
"type": "image_url",
"image_url": { "url": "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg" }
},
{
"type": "text",
"text": "Describe what is happening in this clip in as much detail as possible."
}
]
}
Rust SDK
use anyhow::Result;
use mistralrs::{AudioInput, IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let audio_bytes = reqwest::blocking::get(
"https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg",
)?
.bytes()?
.to_vec();
let audio = AudioInput::from_bytes(&audio_bytes)?;
let image_bytes = reqwest::blocking::get(
"https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg",
)?
.bytes()?
.to_vec();
let image = image::load_from_memory(&image_bytes)?;
let messages = VisionMessages::new()
.add_multimodal_message(
TextMessageRole::User,
"Describe in detail what is happening.",
vec![image],
vec![audio],
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
Ok(())
}
With this, you now have a single-call pipeline that fuses sound, vision, and text – all running locally through mistral.rs! 🔥
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
GLM4 Model
GLM4 is a series of open, multilingual, and multimodal large language models. The text-to-text LLM backbones in GLM4 are supported by mistral.rs.
HTTP API
import openai
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="THUDM/GLM-4-9B-0414",
arch=Architecture.GLM4,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
GLM-4.7-Flash (MoE): zai-org/GLM-4.7-Flash
GLM-4.7-Flash is a mixture of experts (MoE) model from the GLM family with MLA (Multi-head Latent Attention) architecture.
HTTP API
Start the server:
mistralrs serve --isq 4 -p 1234 -m zai-org/GLM-4.7-Flash
Send requests using an OpenAI-compatible client:
import openai
client = openai.Client(base_url="http://localhost:1234/v1", api_key="foobar")
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="zai-org/GLM-4.7-Flash",
arch=Architecture.GLM4MoeLite,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
You can find this example here.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, TextModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("zai-org/GLM-4.7-Flash")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
GLM-4.7 (MoE): zai-org/GLM-4.7
GLM-4.7 is a mixture of experts (MoE) model from the GLM family with standard GQA attention and partial RoPE.
HTTP API
Start the server:
mistralrs serve --isq 4 -p 1234 -m zai-org/GLM-4.7
Send requests using an OpenAI-compatible client:
import openai
client = openai.Client(base_url="http://localhost:1234/v1", api_key="foobar")
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="zai-org/GLM-4.7",
arch=Architecture.GLM4Moe,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, TextModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("zai-org/GLM-4.7")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
GPT-OSS
GPT-OSS is a Mixture of Experts (MoE) language model with specialized attention mechanisms and efficient quantization. Key features include:
- MXFP4 quantized MoE experts for efficient inference
- Per-head attention sinks for improved attention patterns
- YARN RoPE scaling for extended context
- Hybrid cache supporting both full and sliding window attention
mistralrs run -m openai/gpt-oss-20b
Note: GPT-OSS MoE experts are pre-quantized in MXFP4 format. ISQ can be applied to attention layers only.
Note: PagedAttention is not supported for GPT-OSS due to custom attention with sinks.
HTTP API
You can find a more detailed example here.
mistralrs serve -p 1234 -m openai/gpt-oss-20b
import openai
client = openai.OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
You can find a more detailed example here.
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="openai/gpt-oss-20b",
arch=Architecture.GptOss,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
You can find a more detailed example here.
use anyhow::Result;
use mistralrs::{TextMessageRole, TextMessages, TextModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("openai/gpt-oss-20b")
.with_logging()
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Technical Details
MXFP4 Quantization
GPT-OSS MoE experts use MXFP4 (4-bit microscaling floating point) quantization for compact and efficient storage:
gate_up_proj: Packed experts with MXFP4 weightsdown_proj: Packed experts with MXFP4 weights- Scales stored at 1 byte per 32 elements
Attention with Sinks
The model uses per-head attention sinks that are added to attention logits before softmax, helping to regularize attention patterns. This custom attention mechanism is incompatible with PagedAttention.
ISQ Support
In-situ quantization (ISQ) can be applied to attention projection layers:
q_proj,k_proj,v_proj,o_projlm_head
MoE expert layers are already MXFP4 quantized and excluded from ISQ.
Qwen 3: collection
The Qwen 3 family is a collection of hybrid reasoning MoE and non-MoE models ranging from 0.6b to 235b parameters.
mistralrs run --isq 4 -m Qwen/Qwen3-8B
mistralrs run --isq 4 -m Qwen/Qwen3-30B-A3B
Note: mistral.rs can load all FP8 pre-quantized versions natively! Simply replace the model ID.
Note: tool calling support is fully implemented for the Qwen 3 models, including agentic web search.
Enabling thinking
The Qwen 3 models are hybrid reasoning models which can be controlled at inference-time. By default, reasoning is enabled for these models. To dynamically control this, it is recommended to either add /no_think or /think to your prompt. Alternatively, you can specify the enable_thinking flag as detailed by the API-specific examples.
HTTP API
You can find a more detailed example demonstrating enabling/disabling thinking here.
mistralrs serve --isq 4 -p 1234 -m Qwen/Qwen3-8B
import openai
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
# enable_thinking=False,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
You can find a more detailed example demonstrating enabling/disabling thinking here.
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="Qwen/Qwen3-8B",
arch=Architecture.Qwen3,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
# enable_thinking=False,
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
You can find a more detailed example demonstrating enabling/disabling thinking here.
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("Qwen/Qwen3-8B")
.with_isq(IsqType::Q4K)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
// .enable_thinking(false)
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
SmolLM3: HuggingFaceTB/SmolLM3-3B
SmolLM3 is a 3B parameter long-context hybrid reasoning language model. It supports 6 languages, advanced reasoning and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale.
Default, easiest:
mistralrs run --isq 8 -m HuggingFaceTB/SmolLM3-3B
UQFF prequantized:
mistralrs run -m EricB/SmolLM3-3B-UQFF --from-uqff smollm33b-q4k-0.uqff
Note: tool calling support is fully implemented for the SmolLM3 models, including agentic web search.
Check out prequantized UQFF SmolLM3 here: https://huggingface.co/EricB/SmolLM3-3B-UQFF
Enabling thinking
The SmolLM3 models are hybrid reasoning models which can be controlled at inference-time. By default, reasoning is enabled for these models. To dynamically control this, it is recommended to either add /no_think or /think to your prompt. Alternatively, you can specify the enable_thinking flag as detailed by the API-specific examples.
HTTP API
You can find a more detailed example demonstrating enabling/disabling thinking here.
mistralrs serve --isq 8 -p 1234 -m HuggingFaceTB/SmolLM3-3B
import openai
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
# enable_thinking=False,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
You can find a more detailed example demonstrating enabling/disabling thinking here.
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="HuggingFaceTB/SmolLM3-3B",
arch=Architecture.SmolLm3,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
# enable_thinking=False,
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
You can find a more detailed example demonstrating enabling/disabling thinking here.
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("HuggingFaceTB/SmolLM3-3B")
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
// .enable_thinking(false)
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Idefics 2 Model: HuggingFaceM4/idefics2-8b-chatty
The Idefics 2 Model has support in the Rust, Python, and HTTP APIs. The Idefics 2 Model also supports ISQ for increased performance.
Note: Some of examples use our Cephalo model series but could be used with any model ID.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:

Prompt:
What is shown in this image?
Output:
The image depicts a group of orange ants climbing over a black pole. The ants are moving in the same direction, forming a line as they ascend the pole.
- Start the server
mistralrs serve vision -p 1234 --isq 4 -m HuggingFaceM4/idefics2-8b-chatty
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image?",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
This is a minimal example of running the Idefics 2 model with a dummy image.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = VisionModelBuilder::new(
"HuggingFaceM4/idefics2-8b-chatty",
)
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_idefics_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="lamm-mit/Cephalo-Idefics-2-vision-8b-beta",
arch=VisionArchitecture.Idefics2,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image?",
},
],
},
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Idefics 3 Vision: HuggingFaceM4/Idefics3-8B-Llama3
Mistral.rs supports the Idefics 3 vision model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.
UQFF quantizations are also available.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. Check the Hugging Face text model config for more information or raise an issue.
ToC
Using the 🤗 Smol VLM models
Simply substitute the Idefics 3 model ID (HuggingFaceM4/Idefics3-8B-Llama3) with the Smol VLM one (HuggingFaceTB/SmolVLM-Instruct)!
Interactive mode
Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.
- Start up interactive mode with the Idefics 3 model
mistralrs run vision --isq 4 -m HuggingFaceM4/Idefics3-8B-Llama3
- Ask a question
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image depicts a single, large, red rose in full bloom. The rose is positioned against a blurred background that suggests a natural setting, possibly outdoors. The petals of the rose are vividly red with a slight sheen, indicating that they are wet, likely from recent rainfall or dew. The petals are tightly packed and have a velvety texture, which is characteristic of roses. The edges of the petals are slightly curled and appear to be glistening with water droplets, enhancing the overall freshness and beauty of the flower.
The stem of the rose is visible and appears to be green, with a few small thorns scattered along its length. The stem is slender and supports the weight of the large, showy head of the rose. The leaves that accompany the stem are not fully visible in the image but are implied by the presence of the stem.
The background is out of focus, which helps to emphasize the rose as the main subject of the image. The blurred background suggests a natural environment, possibly a garden or a field, with hints of greenery and possibly other flowers or plants. The lighting in the image is natural, likely from sunlight, which casts soft shadows on the petals and adds depth to the scene.
The overall composition of the image focuses on the rose, making it the central point of interest. The wetness of the petals adds a dynamic element to the stillness of the flower, giving it a sense of life and vitality. This could symbolize themes of beauty, nature, and perhaps even passion or love.
In summary, this image captures a single red rose in full bloom with wet petals against a blurred natural background. The rose is the focal point, with its vibrant red color and glistening petals drawing attention. The natural lighting and out-of-focus background enhance the beauty and freshness of the flower.
- Continue the chat by passing another image.
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image depicts a single, large, red rose in full bloom. The rose is positioned against a blurred background that suggests a natural setting, possibly outdoors. The petals of the rose are vividly red with a slight sheen, indicating that they are wet, likely from recent rainfall or dew. The petals are tightly packed and have a velvety texture, which is characteristic of roses. The edges of the petals are slightly curled and appear to be glistening with water droplets, enhancing the overall freshness and beauty of the flower.
The stem of the rose is visible and appears to be green, with a few small thorns scattered along its length. The stem is slender and supports the weight of the large, showy head of the rose. The leaves that accompany the stem are not fully visible in the image but are implied by the presence of the stem.
The background is out of focus, which helps to emphasize the rose as the main subject of the image. The blurred background suggests a natural environment, possibly a garden or a field, with hints of greenery and possibly other flowers or plants. The lighting in the image is natural, likely from sunlight, which casts soft shadows on the petals and adds depth to the scene.
The overall composition of the image focuses on the rose, making it the central point of interest. The wetness of the petals adds a dynamic element to the stillness of the flower, giving it a sense of life and vitality. This could symbolize themes of beauty, nature, and perhaps even passion or love.
In summary, this image captures a single red rose in full bloom with wet petals against a blurred natural background. The rose is the focal point, with its vibrant red color and glistening petals drawing attention. The natural lighting and out-of-focus background enhance the beauty and freshness of the flower.
> \image https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg What mountain is this?
The mountain is Mount Washington.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:

Credit
Prompt:
What is shown in this image? Write a detailed response analyzing the scene.
Output:
The image depicts a majestic mountain landscape under a partly cloudy sky, characterized by its rugged and snow-covered peaks. The mountain is prominently featured in the center of the image, showcasing its expansive and undulating terrain. The summit of the mountain is capped with snow, indicating that it might be winter or early springtime.
The slopes of the mountain are steep and uneven, covered with patches of snow that appear to have been recently fallen or freshly groomed for skiing or other winter activities. There are visible ski trails descending from the summit down towards what seems to be a valley below, suggesting that this location could be a popular ski resort area.
In addition to the main peak, there are smaller hills and ridges surrounding it on both sides. These secondary peaks also have varying degrees of snow cover but appear less prominent than the central peak.
The sky above is mostly overcast with clouds covering most parts but allowing some sunlight to peek through in certain areas, casting soft shadows on parts of the mountainside. This lighting suggests that it might not be midday yet as there isn't an intense brightness typical for noon hours.
On closer inspection near one side of this grandeur scene stands tall trees without leaves; their bare branches starkly contrasting against both white snow and blue sky create an interesting... (cut off)
- Start the server
mistralrs serve vision -p 1234 --isq 4 -m HuggingFaceM4/Idefics3-8B-Llama3
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image? Write a detailed response analyzing the scene.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
const MODEL_ID: &str = "HuggingFaceM4/Idefics3-8B-Llama3";
#[tokio::main]
async fn main() -> Result<()> {
let model = VisionModelBuilder::new(MODEL_ID)
.with_isq(IsqType::Q8_0)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="HuggingFaceM4/Idefics3-8B-Llama3",
arch=VisionArchitecture.Idefics3,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image?",
},
],
},
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
UQFF models
Coming soon!
LLaVA and LLaVANext Model: llava-hf model family
The LLaVA and LLaVANext are great multimodal models that can handle both text and vision inputs.
This implementation supports both LLaVA and LLaVANext(which adds multi resolution image processing) and two types of LLM base model: llama and mistral. Currently it is tested on:
- llava-hf/llava-v1.6-mistral-7b-hf
- llava-hf/llava-v1.6-vicuna-7b-hf
- llava-hf/llava-1.5-7b-hf
The LLaVA and LLaVANext Model has support in the Rust, Python, and HTTP APIs. The LLaVA and LLaVANext Model also supports ISQ for increased performance.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:

Credit
Prompt:
What is shown in this image?
Output:
Text: The image shows a steep, snow-covered hillside with a pine tree on the right side, close to the top. The landscape appears to be a mountainous area with winter conditions. There are no visible humans or permanent structures in the immediate vicinity that suggest this is a summer or recreational location. It's likely a cold, snowy day or season, and the slopes might be part of a mountainous region.
- Start the server
mistralrs serve vision -p 1234 --isq 4 -m llava-hf/llava-v1.6-mistral-7b-hf
# or for vicuna backend, specify the chat template:
mistralrs serve vision -p 1234 --isq 4 -c ./chat_templates/vicuna.json -m llava-hf/llava-v1.6-vicuna-7b-hf
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image?",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
This is a minimal example of running the LLaVA and LLaVANext model with a dummy image.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = VisionModelBuilder::new(
"llava-hf/llava-v1.6-mistral-7b-hf",
)
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_llava_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="llava-hf/llava-v1.6-mistral-7b-hf",
arch=VisionArchitecture.LLaVANext,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image?",
},
],
},
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Llama 3.2 Vision Model: meta-llama/Llama-3.2-11B-Vision-Instruct
Mistral.rs supports the Llama 3.2 vision model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.
UQFF quantizations are also available.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
Note: Some examples use the Cephalo Llama 3.2 model, a member of the Cephalo model collection. This model is finetune of Llama 3.2 with enhanced capabilities in scientific images. To use the base Llama 3.2 Vision model, simply use the associated model ID.
Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. The text model has 40 layers.
ToC
Interactive mode
Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.
https://github.com/user-attachments/assets/4d11c35c-9ea2-42b8-8cab-5f7e8e2ee9ff
- Start up interactive mode with the Llama 3.2 model
mistralrs run vision --isq 4 -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k
- Say hello!
> Hello!
How can I assist you today?
- Pass the model an image and ask a question.
> Hello!
How can I assist you today?
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image shows a close-up view of a rose flower with dew drops on its petals. The rose is in full bloom, with its petals unfolding and displaying vibrant pink coloration. The dew drops on the petals create a delicate, glistening effect, adding to the overall visual appeal of the flower. The background is blurred, focusing attention on the intricate details of the rose.
- Continue the chat by passing another image.
> Hello!
How can I assist you today?
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image shows a close-up view of a rose flower with dew drops on its petals. The rose is in full bloom, with its petals unfolding and displaying vibrant pink coloration. The dew drops on the petals create a delicate, glistening effect, adding to the overall visual appeal of the flower. The background is blurred, focusing attention on the intricate details of the rose.
> \image https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg What mountain is this?
The image appears to be of Mount Washington, which is the highest peak in the Northeastern United States. It is located in the White Mountains of New Hampshire and is known for its extreme weather conditions, including high winds and low temperatures. The mountain's summit reaches an elevation of approximately 6,288 feet (1,917 meters) above sea level.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:

Credit
Prompt:
What is shown in this image? Write a detailed response analyzing the scene.
Output:
The image shows Mount Washington, the highest peak in the Northeastern United States, located in the White Mountains of New Hampshire. The scene captures the mountain's rugged terrain and varied landscape features.
In the foreground, there are dense forests of coniferous trees, primarily spruce and fir, which are typical of the region's boreal forest ecosystem. The trees are densely packed, indicating a high level of vegetation cover and biodiversity.
Moving upwards, the image reveals rocky outcroppings and boulders scattered across the slope, indicating the mountain's geological history of glacial activity. The presence of these rocks suggests that the area was once covered by ice sheets during the last ice age, which carved out the landscape and left behind a mix of boulders and talus slopes.
In the mid-ground, the image shows a series of ridges and valleys, which are characteristic of the mountain's glacially sculpted terrain. These features were formed by the movement of ice sheets that carved out U-shaped valleys and left behind a series of rounded hills and ridges.
At the summit, there is a prominent observation tower or weather station, which is likely used for scientific research and weather monitoring. The structure is situated at an elevation of approximately 6,288 feet (1,917 meters) above sea level, making it one of the highest points in the region.
The image also captures the atmospheric conditions on Mount Washington, with clouds and mist visible in the background. The mountain's unique location in a region where cold Arctic air meets warm moist air from the Gulf Stream creates a unique microclimate known as the "Home Rule," where extreme weather conditions can occur.
Overall, the image showcases the diverse geological and ecological features of Mount Washington, highlighting its role as a significant natural landmark in the Northeastern United States.
- Start the server
mistralrs serve vision -p 1234 --isq 4 -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image? Write a detailed response analyzing the scene.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
const MODEL_ID: &str = "lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k";
#[tokio::main]
async fn main() -> Result<()> {
let model =
VisionModelBuilder::new(MODEL_ID)
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
MODEL_ID = "lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k"
runner = Runner(
which=Which.VisionPlain(
model_id=MODEL_ID,
arch=VisionArchitecture.VLlama,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image? Write a detailed response analyzing the scene.",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
UQFF models
UQFF is a quantized file format similar to GGUF based on ISQ. It removes the memory and compute requirements that come with ISQ by providing ready-made quantizations! The key advantage over GGUF is the flexibility to store multiple quantizations in one file.
We provide UQFF files (EricB/Llama-3.2-11B-Vision-Instruct-UQFF) for this Llama 3.2 Vision model.
You can use these UQFF files to easily use quantized versions of Llama 3.2 Vision.
For example:
mistralrs run -m meta-llama/Llama-3.2-11B-Vision-Instruct --from-uqff EricB/Llama-3.2-11B-Vision-Instruct-UQFF/llama-3.2-11b-vision-q4k.uqff
Llama 4 Series: meta-llama/Llama-4-Scout-17B-16E-Instruct
🚧 We are preparing a collection of UQFF quantized models! 🚧
The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.
Architecture:
- Efficient inference: 17B activated parameters
- Very sparse: 1 activated expert for both Scout (of 16), and Maverick (of 128)
- RoPE enhancement: iRoPE enables high context-length functionality
Integration in mistral.rs:
- Tool calling + Automatic web search
- ISQ
- Rust, Python and HTTP APIs
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:
Credit
Prompt:
Please describe this image in detail.
Output:
The image presents a breathtaking mountain landscape, with a snow-capped peak dominating the scene. The mountain's rugged terrain is characterized by numerous ridges and valleys, while its summit is adorned with several structures that appear to be communication towers or antennas.
**Key Features:**
* **Mountain:** The mountain is the central focus of the image, showcasing a mix of snow-covered and bare areas.
* **Sky:** The sky above the mountain features a dramatic display of clouds, with dark grey clouds at the top gradually giving way to lighter blue skies towards the bottom.
* **Valley:** In the foreground, a valley stretches out, covered in trees that are mostly bare, suggesting a winter setting.
* **Lighting:** The lighting in the image is striking, with the sun casting a warm glow on the mountain's snow-covered slopes while leaving the surrounding areas in shadow.
**Overall Impression:**
The image exudes a sense of serenity and majesty, capturing the beauty of nature in a dramatic and awe-inspiring way. The contrast between the snow-covered mountain and the bare trees in the valley creates a visually appealing scene that invites the viewer to appreciate the natural world.
- Start the server
mistralrs serve vision -p 1234 --isq 4 -m meta-llama/Llama-4-Scout-17B-16E-Instruct
- Send a request
from openai import OpenAI
import httpx
import textwrap
import json
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "Please describe this image in detail.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
This is a minimal example of running the Llama 4 model with a dummy image.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = VisionModelBuilder::new(
"meta-llama/Llama-4-Scout-17B-16E-Instruct",
)
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What is this?",
image,
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="meta-llama/Llama-4-Scout-17B-16E-Instruct",
arch=VisionArchitecture.Llama4,
),
in_situ_quant="4",
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
MiniCPM-O 2.6 Model: openbmb/MiniCPM-o-2_6
Mistral.rs supports the MiniCPM-O 2.6 model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.
UQFF quantizations are coming soon.
Note
Only the vision portion of this model has been implemented. No audio features are supported yet.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
ToC
Interactive mode
Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.
- Start up interactive mode with the MiniCPM-O 2.6 Model model
mistralrs run vision --isq 4 -m openbmb/MiniCPM-o-2_6
- Say hello!
> Hello!
How can I assist you today?
- Pass the model an image and ask a question.
> Hello!
How can I assist you today?
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image shows a close-up view of a rose flower with dew drops on its petals. The rose is in full bloom, with its petals unfolding and displaying vibrant pink coloration. The dew drops on the petals create a delicate, glistening effect, adding to the overall visual appeal of the flower. The background is blurred, focusing attention on the intricate details of the rose.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:

Credit
Prompt:
What is shown in this image? Write a detailed response analyzing the scene.
Output:
The image shows Mount Washington, the highest peak in the Northeastern United States, located in the White Mountains of New Hampshire. The scene captures the mountain's rugged terrain and varied landscape features.
In the foreground, there are dense forests of coniferous trees, primarily spruce and fir, which are typical of the region's boreal forest ecosystem. The trees are densely packed, indicating a high level of vegetation cover and biodiversity.
Moving upwards, the image reveals rocky outcroppings and boulders scattered across the slope, indicating the mountain's geological history of glacial activity. The presence of these rocks suggests that the area was once covered by ice sheets during the last ice age, which carved out the landscape and left behind a mix of boulders and talus slopes.
In the mid-ground, the image shows a series of ridges and valleys, which are characteristic of the mountain's glacially sculpted terrain. These features were formed by the movement of ice sheets that carved out U-shaped valleys and left behind a series of rounded hills and ridges.
At the summit, there is a prominent observation tower or weather station, which is likely used for scientific research and weather monitoring. The structure is situated at an elevation of approximately 6,288 feet (1,917 meters) above sea level, making it one of the highest points in the region.
The image also captures the atmospheric conditions on Mount Washington, with clouds and mist visible in the background. The mountain's unique location in a region where cold Arctic air meets warm moist air from the Gulf Stream creates a unique microclimate known as the "Home Rule," where extreme weather conditions can occur.
Overall, the image showcases the diverse geological and ecological features of Mount Washington, highlighting its role as a significant natural landmark in the Northeastern United States.
- Start the server
mistralrs serve vision -p 1234 --isq 4 -m openbmb/MiniCPM-o-2_6
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image? Write a detailed response analyzing the scene.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
const MODEL_ID: &str = "openbmb/MiniCPM-o-2_6";
#[tokio::main]
async fn main() -> Result<()> {
let model =
VisionModelBuilder::new(MODEL_ID)
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
MODEL_ID = "openbmb/MiniCPM-o-2_6"
runner = Runner(
which=Which.VisionPlain(
model_id=MODEL_ID,
arch=VisionArchitecture.MiniCpmO,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image? Write a detailed response analyzing the scene.",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Mistral Small 3.1 Model: mistralai/Mistral-Small-3.1-24B-Instruct-2503
The Mistral Small 3.1 model is a strong multimodal (text+vision) model with 128k context length, function calling, and strong visual understanding.
We support the Mistral 3 Model in the Rust, Python, and HTTP APIs, including ISQ for increased performance.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
Tool calling with Mistral Small 3.1
The Mistral Small 3.1 model itself does not come with the correct JINJA chat template to enable tool calling. We provide a chat template for
tool calling with Mistral Small 3.1, and you can use it by specifying the jinja_explicit parameter in the various APIs. For example:
mistralrs serve -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:
Credit
Prompt:
What is this?
Output:
The image shows a close-up of a vibrant flower with pink petals and a central cluster of yellowish-brown stamens. This flower appears to be from the genus *Gazania*, commonly known as treasure flowers or gazanias. These flowers are known for their daisy-like appearance and bright colors.
Gazania flowers typically have ray florets (the petal-like structures) that can change color based on light conditions—often appearing more vibrant in direct sunlight. They are popular in gardens for their hardiness and ability to thrive in sunny locations with well-drained soil.
If there's anything specific about this flower or its care that interests you further, feel free to ask!
- Start the server
mistralrs serve vision -p 1234 -m mistralai/Mistral-Small-3.1-24B-Instruct-2503
- Send a request
from openai import OpenAI
import httpx
import textwrap
import json
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/f/fd/Pink_flower.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
This is a minimal example of running the Mistral 3 model with a dummy image.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model =
VisionModelBuilder::new("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
arch=VisionArchitecture.Mistral3,
),
in_situ_quant="4"
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Phi 3.5 Model: microsoft/Phi-3.5-MoE-instruct
The Phi 3.5 MoE model is a 16x3.8B parameter decoder-only text-to-text mixture of expert LLM.
- Context length of 128k tokens
- Trained on 4.9T tokens
- 16 experts (16x3.8B parameters) with 6.6B active parameters
- Expect inference performance of a 7B model
About the MoE mechanism
- Compute router gating logits
- From the router gating logits, select the top-2 selected experts and the associated weights
- The hidden states for each token in the sequence is computed by (if selected) applying the expert output to that token, and then weighting it.
- If multiple experts are selected for the token, then this becomes a weighted sum
- The design is flexible: 2 or 1 experts can be selected, enabling dense or sparse gating
mistralrs run --isq 4 -m microsoft/Phi-3.5-MoE-instruct
Note
This models supports MoQE which can be activated in the ISQ organization parameter within the various APIs, as demonstrated below:
mistralrs run --isq 4 -m microsoft/Phi-3.5-MoE-instruct --isq-organization moqe
HTTP API
mistralrs serve --isq 4 -p 1234 -m microsoft/Phi-3.5-MoE-instruct
import openai
messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
messages.append({"role": "system", "content": prompt})
while True:
prompt = input(">>> ")
messages.append({"role": "user", "content": prompt})
completion = client.chat.completions.create(
model="default",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
messages.append({"role": "assistant", "content": resp})
Python SDK
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="microsoft/Phi-3.5-MoE-instruct",
arch=Architecture.Phi3_5MoE ,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
You can find this example here.
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-MoE-instruct")
.with_isq(IsqType::Q4K)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Phi 3 Vision Model: microsoft/Phi-3.5-vision-instruct
The Phi 3 Vision Model has support in the Rust, Python, and HTTP APIs. The Phi 3 Vision Model supports ISQ for increased performance.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
Note: The Phi 3 Vision model works best with one image although it is supported to send multiple images.
Note: when sending multiple images, they will be resized to the minimum dimension by which all will fit without cropping. Aspect ratio is not preserved in that case.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:

Credit
Prompt:
What is shown in this image? Write a detailed response analyzing the scene.
Output:
The image captures a breathtaking view of a mountain peak, bathed in the soft glow of sunlight. The peak, dusted with a layer of snow, stands tall against the backdrop of a clear blue sky. A trail, etched into the mountain's side by countless hikers before it, winds its way up to the summit. The trail's white color contrasts sharply with the surrounding landscape, drawing attention to its path and inviting exploration.
The perspective from which this photo is taken offers an expansive view of the mountain and its surroundings. It seems as if one could look down from this vantage point and see miles upon miles of untouched wilderness stretching out into the distance. The colors in the image are predominantly blue and white, reflecting both sky and snow-covered mountains respectively. However, there are also hints of green from trees dotting lower parts of mountainsides or valleys below them - adding another layer to this picturesque scene. This serene landscape evokes feelings of tranquility and adventure at once - an invitation to explore nature's grandeur while respecting its majesty at all times!
- Start the server
mistralrs serve vision -p 1234 -m microsoft/Phi-3.5-vision-instruct
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image? Write a detailed response analyzing the scene.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
This is a minimal example of running the Phi 3 Vision model with a dummy image.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model =
VisionModelBuilder::new("microsoft/Phi-3.5-vision-instruct")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_phiv_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="microsoft/Phi-3.5-vision-instruct",
arch=VisionArchitecture.Phi3V,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/e/e7/ Everest_North_Face_toward_Base_Camp_Tibet_Luca_Galuzzi_2006.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image? Write a detailed response analyzing the scene.",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Phi 4 Multimodal Model: microsoft/Phi-4-multimodal-instruct
The Phi 4 Multimodal Model has support in the Rust, Python, and HTTP APIs. The Phi 4 Multimodal Model supports ISQ for increased performance.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
Note: The Phi 4 Multimodal model works best with one image although it is supported to send multiple images.
Note: when sending multiple images, they will be resized to the minimum dimension by which all will fit without cropping. Aspect ratio is not preserved in that case.
Phi 4 multimodal supports audio inputs!.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:

Credit
Prompt:
What is shown in this image? Write a detailed response analyzing the scene.
Output:
A mountain with snow on it.
- Start the server
mistralrs serve vision -p 1234 -m microsoft/Phi-4-multimodal-instruct
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image? Write a detailed response analyzing the scene.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
This is a minimal example of running the Phi 4 Multimodal model with a dummy image.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model =
VisionModelBuilder::new("microsoft/Phi-4-multimodal-instruct")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="microsoft/Phi-4-multimodal-instruct",
arch=VisionArchitecture.Phi4MM,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/e/e7/Everest_North_Face_toward_Base_Camp_Tibet_Luca_Galuzzi_2006.jpg"
},
},
{
"type": "text",
"text": "What is shown in this image? Write a detailed response analyzing the scene.",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Audio input
Alongside vision, Phi 4 Multimodal in mistral.rs can accept audio as an additional modality. This unlocks fully-local pipelines such as text + speech + vision → text where the model can reason jointly over what it hears and what it sees.
mistral.rs automatically decodes the supplied audio (WAV/MP3/FLAC/OGG/… – anything Symphonia can handle) into 16-bit PCM.
OpenAI HTTP API
Audio is delivered with the audio_url content-type that mirrors OpenAIʼs official specification:
{
"role": "user",
"content": [
{
"type": "audio_url",
"audio_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg" }
},
{
"type": "image_url",
"image_url": { "url": "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg" }
},
{
"type": "text",
"text": "Describe what is happening in this clip in as much detail as possible."
}
]
}
Rust SDK
use anyhow::Result;
use mistralrs::{AudioInput, IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = VisionModelBuilder::new("microsoft/Phi-4-multimodal-instruct")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let audio_bytes = reqwest::blocking::get(
"https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg",
)?
.bytes()?
.to_vec();
let audio = AudioInput::from_bytes(&audio_bytes)?;
let image_bytes = reqwest::blocking::get(
"https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg",
)?
.bytes()?
.to_vec();
let image = image::load_from_memory(&image_bytes)?;
let messages = VisionMessages::new()
.add_multimodal_message(
TextMessageRole::User,
"Describe in detail what is happening.",
vec![image],
vec![audio],
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
Ok(())
}
With this, you now have a single-call pipeline that fuses sound, vision, and text – all running locally through mistral.rs! 🔥
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Qwen 2 Vision Model: Qwen2-VL Collection
Mistral.rs supports the Qwen2-VL vision model family, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.
UQFF quantizations are also available.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. The text model has 28 layers.
ToC
Interactive mode
Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.
- Start up interactive mode with the Qwen2-VL model
mistralrs run vision -m Qwen/Qwen2-VL-2B-Instruct
- Say hello!
> Hello!
Hello! How can I assist you today?
- Pass the model an image and ask a question.
> Hello!
Hello! How can I assist you today?
> \image https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg What type of flower is this? Give some fun facts.
flowers are a type of flowering plant that produce flowers that are typically used for decoration, pollination, and reproduction. there are many different types of flowers, each with its own unique characteristics and uses. here are some fun facts about camellias:
* camellias are native to china and have been cultivated for over 2,000 years.
* camellias are known for their long blooming season, with some varieties blooming continuously for months.
* camellias come in a wide variety of colors, including red, pink, white, and yellow.
* camellias are also known for their fragrant blooms, which can be enjoyed by both humans and animals.
* camellias are often used in gardens and parks as a decorative element, and are also popular in landscaping and horticulture.
camellias are also known for their resilience and ability to thrive in a variety of conditions, making them a popular choice for gardeners and landscapers. they require well-draining soil and full sun or partial shade, and can be grown in containers or in the ground. overall, camellias are a beautiful and versatile flower that can add beauty and interest to any garden or landscape.
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:

Prompt:
What type of flower is this? Give some fun facts.
Output:
flowers are a beautiful addition to any garden or outdoor space. They come in many different colors and shapes, and can be used for both decorative purposes and as sources of pollination for bees and other insects.
One fun fact about camellias is that they are native to Japan, but were introduced to Europe in the 17th century by Portuguese sailors who brought them back from their voyages around the world. Camellias have been popular as ornamental plants since then, with many varieties available for cultivation.
Camellias also have interesting cultural significance in Japan, where they are often associated with good fortune and prosperity. In Chinese culture, camellias symbolize longevity and immortality.
In conclusion, camellias are beautiful flowers that add color and interest to gardens or outdoor spaces. They come in many different colors and shapes, making them a popular choice for gardeners everywhere!
- Start the server
mistralrs serve vision -p 1234 -m Qwen/Qwen2-VL-2B-Instruct
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
},
},
{
"type": "text",
"text": "What type of flower is this? Give some fun facts.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
const MODEL_ID: &str = "Qwen/Qwen2-VL-2B-Instruct";
#[tokio::main]
async fn main() -> Result<()> {
let model =
VisionModelBuilder::new(MODEL_ID)
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What type of flower is this? Give some fun facts.",
image,
&model
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
MODEL_ID = "Qwen/Qwen2-VL-2B-Instruct"
runner = Runner(
which=Which.VisionPlain(
model_id=MODEL_ID,
arch=VisionArchitecture.Qwen2VL,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
},
},
{
"type": "text",
"text": "What type of flower is this? Give some fun facts.",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Qwen 3 Vision Model: Qwen3 VL Collection
The Qwen 3 VL models are the successors to the Qwen 2.5 VL models, featuring a diverse lineup of increased performance, flexible sizes, and reasoning-capable models.
Note: Support for the MoE variants is not yet implemented. This is coming very soon!
Mistral.rs supports the Qwen 3 VL vision model family, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.
UQFF quantizations are also available.
The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- Base64 encoded string
The Rust SDK takes an image from the image crate.
Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters.
ToC
Interactive mode
Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.
Start up interactive mode with the Qwen3 VL model:
mistralrs run vision -m Qwen/Qwen3-VL-4B-Instruct
HTTP server
You can find this example here.
We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
- Start the server
mistralrs serve vision -p 1234 -m Qwen/Qwen3-VL-4B-Instruct
- Send a request
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
},
},
{
"type": "text",
"text": "What type of flower is this? Give some fun facts.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
Rust
You can find this example here.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = VisionModelBuilder::new("Qwen/Qwen3-VL-4B-Instruct")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What is this?",
vec![image],
&model,
)?;
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Python
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
MODEL_ID = "Qwen/Qwen3-VL-4B-Thinking"
runner = Runner(
which=Which.VisionPlain(
model_id=MODEL_ID,
arch=VisionArchitecture.Qwen3VL,
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
},
},
{
"type": "text",
"text": "What type of flower is this? Give some fun facts.",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
- You can find an example of encoding the image via base64 here.
- You can find an example of loading an image locally here.
FLUX.1 Model: black-forest-labs/FLUX.1-schnell
The FLUX model is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions.
We support both the -schnell and -dev versions of the model.
Memory usage
The FLUX model itself is 12 billion parameters (~24GB), and the T5 XXL encoder model it uses requires ~9GB. We support loading the models fully onto the GPU, which allows much faster inference. If you do not have enough memory, try the offloaded (-offloaded or -Offloaded) model types. These will load the model on the CPU but perform computations on the GPU.
| Type | Memory requirement | Generation Time (s), A100 |
|---|---|---|
| Normal | ~33GB | 9.4 |
| Offloaded | ~4GB | 92.7 |
HTTP server
The OpenAI HTTP server provides a compatible way to easily use this implementation. As per the specification, output images can be returned as local paths to images or be encoded to base64.
mistralrs serve diffusion -p 1234 -m black-forest-labs/FLUX.1-schnell -a flux
After this, you can send requests via the HTTP server:
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
result = client.images.generate(
model="default",
prompt="A vibrant sunset in the mountains, 4k, high quality.",
n=1,
)
print(result.data[0].url)
Rust example
use std::time::Instant;
use anyhow::Result;
use mistralrs::{DiffusionLoaderType, DiffusionModelBuilder, ImageGenerationResponseFormat};
#[tokio::main]
async fn main() -> Result<()> {
let model = DiffusionModelBuilder::new(
"black-forest-labs/FLUX.1-schnell",
DiffusionLoaderType::FluxOffloaded,
)
.with_logging()
.build()
.await?;
let start = Instant::now();
let response = model
.generate_image(
"A vibrant sunset in the mountains, 4k, high quality.".to_string(),
ImageGenerationResponseFormat::Url,
)
.await?;
let finished = Instant::now();
println!(
"Done! Took {} s. Image saved at: {}",
finished.duration_since(start).as_secs_f32(),
response.data[0].url.as_ref().unwrap()
);
Ok(())
}
Python example
from mistralrs import (
Runner,
Which,
DiffusionArchitecture,
ImageGenerationResponseFormat,
)
runner = Runner(
which=Which.DiffusionPlain(
model_id="black-forest-labs/FLUX.1-schnell",
arch=DiffusionArchitecture.FluxOffloaded,
),
)
res = runner.generate_image(
"A vibrant sunset in the mountains, 4k, high quality.",
ImageGenerationResponseFormat.Url,
)
print(res.choices[0].url)
Dia 1.6b Model: nari-labs/Dia-1.6B
Dia is a 1.6B parameter text to speech model created by Nari Labs. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
- Generate dialogue via the [S1] and [S2] tags
- Generate non-verbal like (laughs), (coughs), etc.
- Below verbal tags will be recognized, but might result in unexpected output. (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)
Note: voice cloning support is coming!
HTTP server
The OpenAI HTTP server provides a drop-in compatible way to easily use Dia locally!
Note: we only support
pcmandwavoutputs.
mistralrs run speech -m nari-labs/Dia-1.6B -a dia
After this, you can send requests via the HTTP server:
from pathlib import Path
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
# text_to_speak = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
text_to_speak = "[S1] mistral r s is a local LLM inference engine. [S2] You can run text and vision models, and also image generation and speech generation. [S1] There is agentic web search, tool calling, and a convenient Python SDK. [S2] Check it out on github."
response = client.audio.speech.create(
model="default", voice="N/A", input=text_to_speak, response_format="wav"
)
output_path = Path("output.wav")
output_path.write_bytes(response.read())
print(f"WAV audio written to {output_path.resolve()}")
Rust example
use std::time::Instant;
use anyhow::Result;
use mistralrs::{speech_utils, SpeechLoaderType, SpeechModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = SpeechModelBuilder::new("nari-labs/Dia-1.6B", SpeechLoaderType::Dia)
.with_logging()
.build()
.await?;
let start = Instant::now();
// let text_to_speak = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face.";
let text_to_speak = "[S1] mistral r s is a local LLM inference engine. [S2] You can run text and vision models, and also image generation and speech generation. [S1] There is agentic web search, tool calling, and a convenient Python SDK. [S2] Check it out on github.";
let (pcm, rate, channels) = model.generate_speech(text_to_speak).await?;
let finished = Instant::now();
let mut output = std::fs::File::create("out.wav").unwrap();
speech_utils::write_pcm_as_wav(&mut output, &pcm, rate as u32, channels as u16).unwrap();
println!(
"Done! Took {} s. Audio saved at `out.wav`.",
finished.duration_since(start).as_secs_f32(),
);
Ok(())
}
Python example
from mistralrs import (
Runner,
Which,
SpeechLoaderType,
)
from pathlib import Path
import wave, struct
# text_to_speak = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
text_to_speak = "[S1] mistral r s is a local LLM inference engine. [S2] You can run text and vision models, and also image generation and speech generation. [S1] There is agentic web search, tool calling, and a convenient Python SDK. [S2] Check it out on github."
runner = Runner(
which=Which.Speech(
model_id="nari-labs/Dia-1.6B",
arch=SpeechLoaderType.Dia,
),
)
res = runner.generate_speech(text_to_speak)
print(res.choices[0].url)
pcm_data = res.pcm # list of floats between -1.0 and 1.0
output_path = Path("output.wav")
# convert floats to 16-bit PCM ints
pcm_ints = [int(max(-32768, min(32767, int(sample * 32767)))) for sample in pcm_data]
with wave.open(output_path, "wb") as wf:
wf.setnchannels(1) # mono
wf.setsampwidth(2 * res.channels) # 2 bytes per sample (16-bit)
wf.setframerate(res.rate) # set sample rate (adjust if needed)
wf.writeframes(b"".join(struct.pack("<h", s) for s in pcm_ints))
print(f"WAV audio written to {output_path.resolve()}")
EmbeddingGemma
EmbeddingGemma was the first embedding model supported by mistral.rs. This guide walks through serving the model via the OpenAI-compatible HTTP server, running it from Python, and embedding text directly in Rust.
For a catalog of available embedding models and general usage tips, see EMBEDDINGS.md.
Prompt instructions
EmbeddingGemma can generate optimized embeddings for various use cases-such as document retrieval, question answering, and fact verification-or for specific input types, either, a query or a document-using prompts that are prepended to the input strings.
- Query prompts follow the form
task: {task description} | query:where the task description varies by the use case, with the default task description being search result. - Document-style prompts follow the form
title: {title | "none"} | text:where the title is either none (the default) or the actual title of the document. Note that providing a title, if available, will improve model performance for document prompts but may require manual formatting.
| Use Case (task type enum) | Descriptions | Recommended Prompt |
|---|---|---|
| Retrieval (Query) | Used to generate embeddings that are optimized for document search or information retrieval. | task: search result | query: {content} |
| Retrieval (Document) | Used to generate embeddings that are optimized for document search or information retrieval (document side). | title: {title | "none"} | text: {content} |
| Question Answering | Used to generate embeddings that are optimized for answering natural language questions. | task: question answering | query: {content} |
| Fact Verification | Used to generate embeddings that are optimized for verifying factual correctness. | task: fact checking | query: {content} |
| Classification | Used to generate embeddings that are optimized to classify texts according to preset labels. | task: classification | query: {content} |
| Clustering | Used to generate embeddings that are optimized to cluster texts based on their similarities. | task: clustering | query: {content} |
| Semantic Similarity | Used to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases. | task: sentence similarity | query: {content} |
| Code Retrieval | Used to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list. Embeddings of code blocks are computed using retrieval_document. | task: code retrieval | query: {content} |
HTTP server
Launch the server in embedding mode to expose an OpenAI-compatible /v1/embeddings endpoint:
mistralrs serve -p 1234 -m google/embeddinggemma-300m
Once running, call the endpoint with an OpenAI client or raw curl:
curl http://localhost:1234/v1/embeddings \
-H "Authorization: Bearer EMPTY" \
-H "Content-Type: application/json" \
-d '{"model": "default", "input": ["task: search result | query: What is graphene?", "task: search result | query: What is an apple?"]}'
An example with the OpenAI client can be found here.
By default the server registers the model as default. To expose it under a custom name or alongside chat
models, run in multi-model mode and assign an identifier in the selector configuration:
{
"embed-gemma": {
"Embedding": {
"model_id": "google/embeddinggemma-300m",
"arch": "embeddinggemma"
}
}
}
See docs/HTTP.md for the full request schema and response layout.
Python SDK
Instantiate Runner with the Which.Embedding selector and request EmbeddingGemma explicitly. The helper method
send_embedding_request returns batched embeddings as Python lists.
from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which
runner = Runner(
which=Which.Embedding(
model_id="google/embeddinggemma-300m",
arch=EmbeddingArchitecture.EmbeddingGemma,
)
)
request = EmbeddingRequest(
input=["task: search result | query: What is graphene?", "task: search result | query: What is an apple?"],
truncate_sequence=True,
)
embeddings = runner.send_embedding_request(request)
print(len(embeddings), len(embeddings[0]))
Refer to this example for a complete runnable script.
Rust SDK
Use the EmbeddingModelBuilder helper from the mistralrs crate to create the model and submit an
EmbeddingRequest:
use anyhow::Result;
use mistralrs::{EmbeddingModelBuilder, EmbeddingRequest};
#[tokio::main]
async fn main() -> Result<()> {
let model = EmbeddingModelBuilder::new("google/embeddinggemma-300m")
.with_logging()
.build()
.await?;
let embeddings = model
.generate_embeddings(
EmbeddingRequest::builder()
.add_prompt("task: search result | query: What is graphene?")
)
.await?;
println!("Returned {} vectors", embeddings.len());
Ok(())
}
This example lives here, and can be run with:
cargo run --package mistralrs --example embedding_gemma
Qwen3 Embedding
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks.
For a catalog of all embedding backends, see EMBEDDINGS.md.
HTTP server
Serve the model with the OpenAI-compatible endpoint enabled:
mistralrs serve -p 1234 -m Qwen/Qwen3-Embedding-0.6B
Call the endpoint via curl or the OpenAI SDK:
curl http://localhost:1234/v1/embeddings \
-H "Authorization: Bearer EMPTY" \
-H "Content-Type: application/json" \
-d '{"model": "default", "input": ["Graphene conductivity", "Explain superconductors in simple terms."]}'
An example with the OpenAI client can be found here.
To expose the model alongside chat models, register it in your selector configuration using the
qwen3embedding architecture tag:
{
"embed-qwen3": {
"Embedding": {
"model_id": "Qwen/Qwen3-Embedding-0.6B",
"arch": "qwen3embedding"
}
}
}
See docs/HTTP.md for the full request schema.
Python SDK
Instantiate Runner with the embedding selector and request Qwen3 explicitly. The output mirrors the
OpenAI embeddings array shape:
from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which
runner = Runner(
which=Which.Embedding(
model_id="Qwen/Qwen3-Embedding-0.6B",
arch=EmbeddingArchitecture.Qwen3Embedding,
)
)
request = EmbeddingRequest(
input=["Graphene conductivity", "Explain superconductors in simple terms."],
truncate_sequence=True,
)
embeddings = runner.send_embedding_request(request)
print(len(embeddings), len(embeddings[0]))
A ready-to-run version can be found at examples/python/qwen3_embedding.py.
Rust SDK
Use the EmbeddingModelBuilder helper just like with EmbeddingGemma. The example below mirrors the
repository sample:
use anyhow::Result;
use mistralrs::{EmbeddingModelBuilder, EmbeddingRequest};
#[tokio::main]
async fn main() -> Result<()> {
let model = EmbeddingModelBuilder::new("Qwen/Qwen3-Embedding-0.6B")
.with_logging()
.build()
.await?;
let embeddings = model
.generate_embeddings(
EmbeddingRequest::builder()
.add_prompt("What is graphene?")
.add_prompt("Explain superconductors in simple terms.")
)
.await?;
println!("Returned {} vectors", embeddings.len());
Ok(())
}
You can find the full example at mistralrs/examples/qwen3_embedding/main.rs.
Quantization in mistral.rs
Mistral.rs supports the following quantization:
- ⭐ ISQ (read more detail)
- Supported in all plain/vision and adapter models
- Works on all supported devices
- Automatic selection to use the fastest and most accurate method
- Supports:
- Q, K type GGUF quants
- AFQ
- HQQ
- FP8
- GGUF/GGML
- Q, K type
- Supported in GGUF/GGML and GGUF/GGML adapter models
- Supported in all plain/vision and adapter models
- Imatrix quantization is supported
- I quants coming!
- CPU, CUDA, Metal (all supported devices)
- 2, 3, 4, 5, 6, 8 bit
- GPTQ (convert with this script)
- Supported in all plain/vision and adapter models
- CUDA only
- 2, 3, 4, 8 bit
- Marlin kernel support in 4-bit and 8-bit.
- AWQ (convert with this script)
- Supported in all plain/vision and adapter models
- CUDA only
- 4 and 8 bit
- Marlin kernel support in 4-bit and 8-bit.
- HQQ
- Supported in all plain/vision and adapter models via ISQ
- 4, 8 bit
- CPU, CUDA, Metal (all supported devices)
- FP8
- Supported in all plain/vision and adapter models
- CPU, CUDA, Metal (all supported devices)
- BNB
- Supported in all plain/vision and adapter models
- bitsandbytes int8, fp4, nf4 support
- AFQ
- 2, 3, 4, 6, 8 bit
- 🔥 Designed to be fast on Metal!
- Only supported on Metal.
- MLX prequantized
- Supported in all plain/vision and adapter models
Using a GGUF quantized model
- Use the
gguf(cli) /GGUF(Python) model selector - Provide the GGUF file
mistralrs run --format gguf -f my-gguf-file.gguf
Using ISQ
See the docs
mistralrs run --isq 4 -m microsoft/Phi-3-mini-4k-instruct
Using a GPTQ quantized model
- Provide the model ID for the GPTQ model
- Mistral.rs will automatically detect and use GPTQ quantization for plain and vision models!
- The Marlin kernel will automatically be used for 4-bit and 8-bit.
mistralrs run -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bit
You can create your own GPTQ model using [scripts/convert_to_gptq.py][../scripts/convert_to_gptq.py]:
pip install gptqmodel transformers datasets
python3 scripts/convert_to_gptq.py --src path/to/model --dst output/model/path --bits 4
Using a MLX prequantized model (on Metal)
- Provide the model ID for the MLX prequantized model
- Mistral.rs will automatically detect and use quantization for plain and vision models!
- Specialized kernels will be used to accelerate inference!
mistralrs run -m mlx-community/Llama-3.8-1B-8bit
In situ quantization
In situ quantization works by quantizing models inplace, with the chief benefit being reduced memory footprint when running the model. This enables larger model to be run on devices which would not fit the full weights, and may increase model inference performance.
Quick start: Just use --isq 4 (or 2, 3, 5, 6, 8) and mistral.rs will pick the best quantization for your hardware:
mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct
An API is exposed on the Python and Rust SDKs which provides the ability to dynamically re-ISQ models at runtime.
To set the ISQ type for individual layers, use a model topology.
Note: 🔥 AFQ (affine) quantization is designed to be fast on Metal but is only supported on Metal.
Automatic ISQ (just use a number!)
Instead of specifying a quantization type like Q4K, you can just pass an integer (2, 3, 4, 5, 6, or 8) and mistral.rs will automatically select the best quantization method for your platform.
On Metal, this uses fast AFQ quantization (for 2, 3, 4, 6, or 8 bits). On other platforms, it falls back to Q/K quantization.
mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct
ISQ quantization types
- AFQ2 (AFQ is only available on Metal)
- AFQ3
- AFQ4
- AFQ6
- AFQ8
- Q4_0
- Q4_1
- Q5_0
- Q5_1
- Q8_0
- Q8_1 (not available on CUDA)
- Q2K
- Q3K
- Q4K
- Q5K
- Q6K
- Q8K (not available on CUDA)
- HQQ4
- HQQ8
- FP8
mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct
When using ISQ, it will automatically load ISQ-able weights into CPU memory before applying ISQ. The ISQ application process moves the weights to device memory. This process is implemented to avoid memory spikes from loading the model in full precision.
For Mixture of Expert models, a method called MoQE can be applied to only quantize MoE layers. This is configured via the ISQ “organization” parameter in all APIs. The following models support MoQE:
Accuracy
Accuracy of ISQ can be measured by the performance degradation versus the unquantized model. This is commonly measured with perplexity. Please see the perplexity example.
To improve the accuracy of a model with ISQ, use an imatrix file. These can be found online (for example, on Hugging Face), and should be passed with the --imatrix flag for plain models. This will increase the accuracy of the quantization significantly and bring the ISQ quantization up to par with the GGUF counterpart.
Check out the imatrix docs.
Python Example
runner = Runner(
which=Which.Plain(
model_id="Qwen/Qwen3-0.6B",
),
in_situ_quant="4",
)
Rust Example
You can find this example here.
#![allow(unused)]
fn main() {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
}
Server example
mistralrs serve --port 1234 --isq 4 -m mistralai/Mistral-7B-Instruct-v0.1
Or with a specific quantization type:
mistralrs serve --port 1234 --isq Q4K -m mistralai/Mistral-7B-Instruct-v0.1
Universal Quantized File Format: UQFF
The uniquely powerful quantized file format.
- Flexible 🌀: Multiple quantization formats in one file format with one framework to run them all.
- Reliable 🔒: Compatibility ensured with embedded and checked semantic versioning information from day 1.
- Easy 🤗: Download UQFF models easily and quickly from Hugging Face, or use a local file.
- Customizable 🛠️: Make and publish your own UQFF files in minutes.
ToC
Motivation
UQFF builds on our ISQ feature by allowing serialization and deserialization for models.
While ISQ is a powerful feature enabling easy quantization of models, the key limitation has been the time required for requantization. While the process is relatively fast with parallelization and other techniques, multiple runs can make the experience slow.
Comparting UQFF to GGUF:
In contrast to GGUF, which only supports the GGUF quantizations, UQFF is designed with flexibiliuty in mind. At its code, it extends the power and flexibility of ISQ. The ability to support multiple quantization types (more to come!) in one simple, easy-to-use file is a critical feature.
Additionally, users will no longer need to wait for GGUF support to begin using post-training quantized models. As we add new models and quantization schemes to mistral.rs, the feature set of UQFF will grow.
Support
The following quantization formats are supported in UQFF. One can, of course, be combined arbitrarily during UQFF generation or ISQ using a model topology. When loading a UQFF model, only the per-layer device mapping feature of the topology applies.
-
GGUF quantized:
- Q4_0
- Q4_1
- Q5_0
- Q5_1
- Q8_0
- Q8_1 (not available on CUDA)
- Q2K
- Q3K
- Q4K
- Q5K
- Q6K
- Q8K (not available on CUDA)
-
HQQ quantized:
- HQQ4
- HQQ8
-
FP8:
- FP8 E4M3 (4-bit exponent, 3-bit mantissa)
-
AFQ quantized (🔥 AFQ is fast on Metal):
- AFQ2
- AFQ3
- AFQ4
- AFQ6
- AFQ8
Loading a UQFF model
To load a UQFF model, one should specify the filename. This will be located based on the model ID, and can be loaded locally or from Hugging Face based on the model ID.
phi3.5-mini-instruct-q4k.uqff../UQFF/phi3.5-mini-instruct-q4k.uqff
You can find a collection of UQFF models here, which each include a simple command to get started.
Note: when loading an UQFF model, any ISQ setting will be ignored.
Running with the CLI
mistralrs run -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-f8e4m3.uqff
Using with the Rust SDK
Check out the following examples:
- Normal: uqff/main.rs
- Vision: uqff_vision/main.rs
Using the Python SDK
Modify the Which instantiation as follows:
Which.Plain(
model_id="EricB/Phi-3.5-mini-instruct-UQFF",
+ from_uqff="phi3.5-mini-instruct-q4k.uqff"
),
Using topology for device mapping with UQFF
When loading a UQFF model, the quantization is already baked in, so ISQ settings in the topology are ignored. However, device mapping from a topology file still applies. This is useful for splitting a pre-quantized model across multiple GPUs or offloading layers to CPU.
CLI example:
mistralrs run -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-q4k.uqff --topology device_map.yml
Topology file for device mapping only (device_map.yml):
0-16:
device: cuda[0]
16-32:
device: cuda[1]
Rust SDK example:
#![allow(unused)]
fn main() {
use mistralrs::{UqffTextModelBuilder, Topology, LayerTopology, Device};
let model = UqffTextModelBuilder::new(
"EricB/Phi-3.5-mini-instruct-UQFF",
vec!["phi3.5-mini-instruct-q4k.uqff".into()],
)
.into_inner()
.with_topology(
Topology::empty()
.with_range(0..16, LayerTopology { isq: None, device: Some(Device::Cuda(0)) })
.with_range(16..32, LayerTopology { isq: None, device: Some(Device::Cuda(1)) })
)
.build()
.await?;
}
Python SDK example:
runner = Runner(
which=Which.Plain(
model_id="EricB/Phi-3.5-mini-instruct-UQFF",
from_uqff="phi3.5-mini-instruct-q4k.uqff",
topology="device_map.yml",
),
)
Note: The
isqfield in topology entries is ignored when loading UQFF models since quantization is pre-applied.
Creating a UQFF model
Creating a UQFF model requires you to generate the UQFF file.
- This means specifying a local path to a file ending in
.uqff, where your new UQFF model will be created. - The quantization of a UQFF model is determined from the ISQ or model topology (see the topology docs for more details on how ISQ and the topology mix).
Along with the UQFF file, the generation process will also output several .json configuration files and residual.safetensors. All of these files are considered the
UQFF model, and should be kept together or uploaded.
Note: Only the
.uqfffiles are unique to the quantization level(s). If you are generating multiple UQFF files, it is OK for the others to be overwritten.
After creating the UQFF file, you can upload the model to Hugging Face. To do this:
- Create a new model.
- Upload the UQFF file:
- With the web interface: guide here.
- With Git: steps here
- Locally, generate the model card file with this Python script..
- In the web interface, press the
Create Model Cardbutton and paste the generated model card.
⭐ Check out uqff_maker to make UQFF models with an easy CLI!
mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq 4 -o phi3.5-mini-instruct-q4k.uqff
Upload with Git
To upload a UQFF model using Git, you will most likely need to set up Git LFS:
- Install git-lfs
- Run
git lfs install - (If the files are larger than 5GB) Run
huggingface-cli lfs-enable-largefiles .(you will need topip install huggingface_hub)
After this, you can use Git to track, commit, and push files.
List of models
You can find a list of models in the Hugging Face model collection.
Have you created a UQFF model on Hugging Face? If so, please create an issue.
UQFF internal structure
The following describes the exact memory layout of UQFF tensors of version 0.1.0.
ToC
GGUF quantization
| ID | Element type | Endianness |
|---|---|---|
| UQFF version | u32 | little endian |
| ISQ type (0) | u8 | little endian |
| Tensor data length in bytes | u32 | little endian |
| Whether bias data is included (boolean) | u8 | little endian |
| Quantized dtype | u32 | little endian |
| Num shape dims | u32 | little endian |
| Array quantized weight shape dims | u32 | little endian |
| Array quantized weight data | u8 | little endian |
| [Optional] Array Bias tensor data, see docs | See docs | See docs |
Unquantized layers
| ID | Element type | Endianness |
|---|---|---|
| UQFF version | u32 | little endian |
| ISQ type (1) | u8 | little endian |
| Whether bias data is included (boolean) | u8 | little endian |
| Array Weight tensor data, see docs | See docs | See docs |
| [Optional] Array Bias tensor data, see docs | See docs | See docs |
FP8 layers
| ID | Element type | Endianness |
|---|---|---|
| UQFF version | u32 | little endian |
| ISQ type (1) | u8 | little endian |
| Whether bias data is included (boolean) | u8 | little endian |
| Array Weight tensor data, see docs | See docs | See docs |
| Dequant W scalar | f32 | little endian |
| Dequant X scalar | f32 | little endian |
| Quant scalar | f32 | little endian |
| Quantization type | u32 | little endian |
| [Optional] Array Bias tensor data, see docs | See docs | See docs |
HQQ quantization
| ID | Element type | Endianness |
|---|---|---|
| UQFF version | u32 | little endian |
| ISQ type (2) | u8 | little endian |
| Whether bias data is included (boolean) | u8 | little endian |
| Array Q weight, see docs | See docs | See docs |
| Array Q scale, see docs | See docs | See docs |
| Array Q zeroes, see docs | See docs | See docs |
| Dequant weight num shape dims | u32 | little endian |
| Array dequant weight shape dims | u32 | little endian |
| CFG bits | u8 | little endian |
| CFG group size | u32 | little endian |
| CFG axis | u8 | little endian |
CFG optimization steps (0 means Option::None for now) | u32 | little endian |
| CFG round zeroes (boolean) | u8 | little endian |
| CFG channel wise (boolean) | u8 | little endian |
FP8 layers
| ID | Element type | Endianness |
|---|---|---|
| UQFF version | u32 | little endian |
| ISQ type (3) | u8 | little endian |
| Whether bias data is included (boolean) | u8 | little endian |
| Array Weight tensor data, see docs | See docs | See docs |
| Dequant scale W | f32 | little endian |
| Dequant scale X | f32 | little endian |
| Quant scale | f32 | little endian |
| Layer dtype | u32 | little endian |
| [Optional] Array Bias tensor data, see docs | See docs | See docs |
Standard tensors
| ID | Element type | Endianness |
|---|---|---|
| Tensor data length in bytes | u32 | little endian |
| Tensor dtype | u32 | little endian |
| Num shape dims | u32 | little endian |
| Array shape dims | u32 | little endian |
| Array flattened (contiguous) tensor data | u8 | little endian |
Model topology configuration
Quantization and device mapping in one file.
Note
Manual device mapping flags are deprecated in favor of automatic placement because it is easy to misconfigure them. Topology files remain the preferred way to express per-layer quantization, and you can still provide
deviceoverrides here when you truly need to. Those overrides win over the automatic mapper, so apply them sparingly. See the device mapping documentation for guidance.
Use a simple model topology to configure ISQ and device mapping for per-layer with a single YAML file (examples here)!
To support per-layer mix of ISQ, Mistral.rs supports loading a model topology YAML file. This YAML file is formatted as follows:
- Top-level keys are either:
- A range of layers (
start-end) wherestart < end.startis inclusive andendis exclusive - A single layer number
- The topology for the range or layer:
- An optional key (
isq) which maps to a single value, which can be any ISQ type. If not specified, there is no ISQ for this range of layers applied. - An optional key (
device) which maps to a single value, which is one of the below. If not specified, the default loading deice will be used.cpucuda[ORDINAL]metal[ORDINAL]
- An optional key (
- A range of layers (
Note that:
- The topology for the range is expanded to fill the range
- If ranges overlap, the range with the higher end layer takes precedence. When two ranges share the same end layer, the one that appears later in the topology file wins.
- Any layers which are not covered will have no topology mapping. They will inherit any other ISQ (e.g. with
--isq/in_situ_quant) set. - Unless the layer is not covered by the topology, the topology value will override any other ISQ (e.g. with
--isq/in_situ_quant). - The topology device mapping will override any other device mapping.
Using topology with UQFF models
When loading a UQFF model, the quantization is already applied during UQFF creation. Therefore:
- ISQ settings in the topology are ignored - the pre-quantized weights are used as-is
- Device mapping still applies - you can split layers across GPUs or offload to CPU
This is useful for deploying pre-quantized models across multiple devices without re-quantizing.
Example topology for UQFF device mapping:
# Only device mapping is used; isq would be ignored
0-16:
device: cuda[0]
16-32:
device: cuda[1]
See the UQFF documentation for complete examples.
Regex selectors
Layer ranges are convenient when you know the numeric index, but you can also target weights by name. Keys wrapped in /.../ are interpreted as regular expressions that are matched against the fully qualified tensor name (for example, model.layers.3.attn.q_proj.weight). Regex selectors may override both isq and device.
'/attn\.q_proj$/':
isq: Q4K
'/ffn_.*\.weight$/':
isq: Q3K
Regex-based ISQ overrides are applied through the immediate ISQ system, so they quantize weights as they are loaded. Numeric layer ranges continue to be handled by the post-load topology pass. Regex selectors are evaluated top-to-bottom as they appear in the YAML file, so a selector that comes later in the file overrides earlier matches.
0-8:
isq: Q3K
device: cuda[0]
8-16:
isq: Q4K
device: cpu
16-24:
isq: Q6K
# Skip 24-28
28-32:
isq: Q8_0
device: cuda[0]
Model topologies may be applied to all model types.
CLI example
mistralrs run -m microsoft/Phi-3-mini-128k-instruct --topology topologies/isq.yml
HTTP server example
mistralrs serve -p 1234 -m microsoft/Phi-3-mini-128k-instruct --topology topologies/isq.yml
Rust example
Example here.
Python example
Example here.
Enhancing ISQ with an imatrix
Mistral.rs supports enhancing the performance of models quantized with ISQ by collecting an imatix from calibration data. The following quantizations are supported with an imatrix:
Q2KQ3KQ4KQ5KQ6K
What is an imatrix? An imatrix (importance matrix) is generated from data collected during the execution of the model on calibration data. This data is used to enhance the performance of the model by enabling a weighted RMSE minimization when quantizing the tensor. For more information, see the original PR.
Using an imatrix causes the quantization process to take longer as the data must be collected, but there is no inference-time performance decrease.
Note: mistral.rs will automatically generate a .cimatrix file which can be used within mistral.rs as a replacement for a .imatrix file. The primary advantage is the in-situ generation within mistral.rs. The format is incompatible with llama.cpp.
To use this, simply specify the calibration data file in the various APIs as detailed below.
With the CLI
mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct --calibration-file calibration_data/calibration_datav3_small.txt
With the Rust SDK
You can find this example here.
#![allow(unused)]
fn main() {
let model = TextModelBuilder::new("meta-llama/Llama-3.2-3B-Instruct")
.with_isq(IsqType::Q4K)
.with_calibration_file("calibration_data/calibration_datav3_small.txt".into())
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
}
With the Python SDK
You can find this example here.
runner = Runner(
which=Which.Plain(
model_id="meta-llama/Llama-3.2-3B-Instruct",
calibration_file="calibration_data/calibration_datav3_small.txt"
),
in_situ_quant="4",
)
Adapter model support
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting an XLora* architecture, and LoRA support by selecting the Lora* architecture. For both X-LoRA and LoRA, an ordering file (see this section for preparing the ordering file) must be provided. The ordering file describes the ordering of layers and which adapters to use (and what order to use them in for X-LoRA).
When using an adapter model with a quantized base model, if the ordering file specifies unsupported layers you will receive an error.
Supported X-LoRA or LoRA quantized layers**
Llama architecture:
- model.layers.{layer_idx}.self_attn.q_proj
- model.layers.{layer_idx}.self_attn.k_proj
- model.layers.{layer_idx}.self_attn.v_proj
- model.layers.{layer_idx}.self_attn.o_proj
- model.layers.{layer_idx}.mlp.up_proj
- model.layers.{layer_idx}.mlp.down_proj
- model.layers.{layer_idx}.mlp.gate_proj
- lm_head
Phi 3 architecture:
- model.layers.{layer_idx}.self_attn.qkv_proj
- model.layers.{layer_idx}.self_attn.o_proj
- model.layers.{layer_idx}.mlp.gate_up_proj
- model.layers.{layer_idx}.mlp.down_proj
- lm_head
Adapter ordering file
Preparing the X-LoRA/LoRA Ordering File
The X-LoRA/LoRA ordering file is necessary to prepare before inference with an X-LoRA model. However, it is easy with a provided script!
X-LoRA case
An ordering JSON file for X-LoRA contains 2 major parts.
- The adapter names
order- The order matters!
- Should be an array of strings which are the adapter names corresponding to the order the adapters were specified during training. For example, if the adapters were specified as a dictionary:
- The layer ordering
layers- Automatically generated and should not be manipulated as it controls the application of scalings.
adapters = {
"math": ...,
"reasoning": ...,
"biology": ...
}
The specified order would be ["math", "reasoning", "biology"].
We provide an ordering file which contains the ordering for the X-LoRA model associated with the paper and the Huggingface repository: https://huggingface.co/lamm-mit/x-lora.
LoRA case
An ordering JSON file for LoRA contains 2 major parts:
- The adapter names
order(optional):- The order does not matter
- Come controls which adapters will be initially activated
- If this key is not specified, then no adapters will be activated initially
- Preload adapter section
preload_adapters(optional): see this section- Order does not matter
- Specifies the adapter name and the model ID to find them, which may be a local path.
Preparing the ordering file (LoRA or X-LoRA cases)
There are 2 scripts to prepare the ordering file and which work for both X-LoRA and LoRA. The ordering file is specific to each architecture and set of target modules. Therefore, if either are changed, it is necessary to create a new ordering file using the first option. If only the adapter order or adapters changed, then the second option should be used.
-
From scratch: No ordering file for the architecture and target modules
A script
create_ordering.pyis provided which prompts the user for the model ID, target modules, and adapter names. The user is prompted for an output file location, relative to the working directory. -
Create a new ordering file from an existing ordering file for an architecture and target modules
A script
set_names.pyis provided which prompts the user for the adapter names and the old ordering file. The user is prompted for an output file location, relative to the working directory.
Quantized X-LoRA or LoRA models
Mistral.rs supports running quantized models with X-LoRA or LoRA. The X-LoRA or LoRA adapter layers will not be quantized, only the base model. P
In the X-LoRA case, please note that using a high quantization level (eg., 4-bit) can distort the signal and prevent the classifier from acting properly. Therefore, it is better to use slightly lower levels such as 8-bit.
Avoiding the scaling pass with non-granular scalings
The X-LoRA implementation supports non-granular scalings. This caches the scalings after k completion tokens are generated and they will be used for the remaining passes avoiding the scaling pass. The number of tokens to generate before caching is defined by setting tgt_non_granular_index. Setting tgt_non_granular_index will restrict the maximum running sequences to 1.
Please see this page for more details and examples.
Adapter model dynamic adapter activation
We support dynamic adapter activation for LoRA models, allowing you to activate a set of adapters at runtime. There is a Python, Rust and HTTP API:
To use this feature, you should add a preload_adapters key to your ordering file:
{
"order": ["..."],
"layers": {"...": "123"},
"base_model_id": "...",
+ "preload_adapters": [{"name": "...", "adapter_model_id": "..."}] # New field here
}
This allows mistral.rs to preload the adapter and enable runtime activation.
Examples of LoRA and X-LoRA models
- X-LoRA with no quantization
To start an X-LoRA server with the exactly as presented in the paper:
mistralrs serve -p 1234 --xlora lamm-mit/x-lora --xlora-order orderings/xlora-paper-ordering.json
- LoRA with a model from GGUF
To start a LoRA server with adapters from the X-LoRA paper (you should modify the ordering file to use only one adapter, as the adapter static scalings are all 1 and so the signal will become distorted):
mistralrs serve -p 1234 --format gguf -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q8_0.gguf --lora lamm-mit/x-lora
Normally with a LoRA model you would use a custom ordering file. However, for this example we use the ordering from the X-LoRA paper because we are using the adapters from the X-LoRA paper.
X-LoRA non-granular scalings
A key limitation of the X-LoRA architecture is the need for 2 forward passes of the model per generation step. To trade off model performance for speed, mistral.rs allows the user to reduce the granularity of the scalings by caching them in a technique we call Non Granular Scalings.
How it works
For the first $k$ generation steps, the scalings are calculated normally for each token. However, for the rest of the tokens, it is cached and re-used. In this way, we are able to avoid the second forward pass and the performance is increased significantly. To maintain correctness, enabling non-granular scalings will restrict the engine to processing one sequence at a time.
How to use it
Command line
This can be enabled by passing --tgt-non-granular-index followed by $k$:
mistralrs serve -p 1234 --xlora lamm-mit/x-lora --xlora-order orderings/xlora-paper-ordering.json --tgt-non-granular-index 5
Python
Set the tgt_non_granular_index attribute to a non-None value in the Which selection:
from mistralrs import Runner, Which
runner = Runner(
which=Which.XLoraGGUF(
tok_model_id=None, # Automatically determine from ordering file
quantized_model_id="TheBloke/zephyr-7B-beta-GGUF",
quantized_filename="zephyr-7b-beta.Q4_0.gguf",
xlora_model_id="lamm-mit/x-lora",
order="orderings/xlora-paper-ordering.json",
tgt_non_granular_index=5,
)
)
...
Build a memory-efficient MoE model from anything, in seconds
AnyMoE is technique to dynamically and efficiently create MoE models. By providing a set of experts and a small pretraining dataset, you can create an MoE locally!
It has the following features:
- Apply AnyMoE to any supported model
plainvision-plain
- Specify the layers to apply AnyMoE to for efficient training
Paper: https://arxiv.org/abs/2405.19076
https://github.com/EricLBuehler/mistral.rs/assets/65165915/33593903-d907-4c08-a0ac-d349d7bf33de
Note: By default, this has the capability to create an csv loss image. When building from source (for Python or CLI), you may use
--no-default-featurescommand line to disable this. This may be necessary if networking is unavailable.
Dataset
Currently, AnyMoE expects a JSON dataset with one top-level key row, which is an array of objects with keys prompt (string), expert (integer), and image_urls (optional array of strings). For example:
{
"rows": [
{
"prompt": "Discuss the impact of Renaissance art on modern aesthetics",
"expert": 0
},
{
"prompt": "Explain the significance of the theory of relativity in modern physics",
"expert": 1
},
]
}
For a vision model, image_urls may contain an array of image URLs/local paths or Base64 encoded images.
Experts
AnyMoE experts can be either fine-tuned models or LoRA adapter models. Only the mlp layers will be loaded from each. The experts must be homogeneous: they must be all fine-tuned or all adapter. Additionally, certain layers can be specified to apply AnyMoE.
Note: When using LoRA adapter experts, it may not be necessary to set the layers where AnyMoE will be applied due to the lower memory usage.
Example of TOML selector with fine-tuned experts
[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"
[anymoe]
dataset_json = "examples/amoe.json"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["HuggingFaceH4/zephyr-7b-beta"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[anymoe.config]
hidden_size = 4096
expert_type = "fine_tuned"
Example of TOML selector with LoRA adapter experts
[model]
model_id = "HuggingFaceH4/zephyr-7b-beta"
arch = "mistral"
[anymoe]
dataset_json = "examples/amoe.json"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["EricB/example_adapter"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[anymoe.config]
hidden_size = 4096
[anymoe.config.expert_type.lora_adapter]
rank = 16
alpha = 16
target_modules = ["gate_proj"]
Examples
CLI
CLI usage is via the TOML selector where you can also find docs on the required fields.
For example, to use the demo fine-tuned expert:
mistralrs from-config --file toml-selectors/anymoe.toml
To use the demo LoRA expert:
mistralrs from-config --file toml-selectors/anymoe_lora.toml
Python example
from mistralrs import (
Runner,
Which,
ChatCompletionRequest,
Architecture,
AnyMoeConfig,
AnyMoeExpertType,
)
runner = Runner(
which=Which.Plain(
model_id="mistralai/Mistral-7B-Instruct-v0.1",
arch=Architecture.Mistral,
),
anymoe_config=AnyMoeConfig(
hidden_size=4096,
dataset_json="examples/amoe.json",
prefix="model.layers",
mlp="mlp",
expert_type=AnyMoeExpertType.FineTuned(),
lr=1e-3,
epochs=100,
batch_size=4,
model_ids=["HuggingFaceH4/zephyr-7b-beta"],
),
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
You can find this example here.
use anyhow::Result;
use mistralrs::{
AnyMoeConfig, AnyMoeExpertType, AnyMoeModelBuilder, IsqType, PagedAttentionMetaBuilder,
TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let text_builder = TextModelBuilder::new("mistralai/Mistral-7B-Instruct-v0.1")
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?;
let model = AnyMoeModelBuilder::from_text_builder(
text_builder,
AnyMoeConfig {
hidden_size: 4096,
lr: 1e-3,
epochs: 100,
batch_size: 4,
expert_type: AnyMoeExpertType::LoraAdapter {
rank: 64,
alpha: 16.,
target_modules: vec!["gate_proj".to_string()],
},
gate_model_id: None, // Set this to Some("path/to/model/id") for the pretrained gating model id
training: true,
loss_csv_path: None,
},
"model.layers",
"mlp",
"examples/amoe.json",
vec!["HuggingFaceH4/zephyr-7b-beta"],
vec![0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
)
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Matformer (Matryoshka Transformer) Support
Matformer allows you to dynamically resize transformer models at runtime, trading compute/memory for quality. This enables deploying the same model across devices with different resource constraints - from edge devices to powerful GPUs.
Quick Start
Command Line
# Run Gemma 3n with the E2.49B configuration (2.49B params instead of 3.98B)
mistralrs run -m google/gemma-3n-E4B-it \
--matformer-config-path matformer_configs/gemma3n.csv \
--matformer-slice-name "Config for E2.49B (block-level)"
Python
from mistralrs import Runner, Which, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3n-E4B-it",
arch=VisionArchitecture.Gemma3n,
matformer_config_path="matformer_configs/gemma3n.csv",
matformer_slice_name="Config for E2.49B (block-level)",
),
)
Rust
#![allow(unused)]
fn main() {
use mistralrs::VisionModelBuilder;
use std::path::PathBuf;
let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
.with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
.with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
.build()
.await?;
}
How It Works
Matformer models are pre-trained with a special architecture that allows certain layers to be skipped at inference time while maintaining reasonable quality. When you select a “slice”:
- Layer Skipping: Specified layers are completely removed from computation
- FFN Resizing: Feed-forward network dimensions can be adjusted per layer
- Automatic Remapping: Remaining layers are renumbered sequentially
For example, the Gemma 3n E2.49B (block-level) slice:
- Keeps all 35 layers (no layer skipping)
- Uses mixed FFN dimensions: 8192 for layers 0-19, 16384 for layers 20-24, 8192 for layers 25-34
- Cuts parameters from 3.98B to 2.49B (~37% reduction)
- Maintains ~87% of the full model’s quality
Configuration Files
Matformer configurations are CSV files with these columns:
name,# Layers,# Effective Params (B),MMLU PT accuracy,FFN Hidden Dims,Layers Skipped
Main model,35,3.98,62.30%,"[16384, 16384, ...]",
Config for E2.49B (block-level),35,2.49,54.50%,"[8192, 8192, ..., 16384, 16384, ..., 8192, 8192, ...]",
- name: Slice identifier used in
matformer_slice_name - # Layers: Number of active layers after skipping
- # Effective Params (B): Approximate parameter count in billions
- MMLU PT accuracy: Benchmark score (informational)
- FFN Hidden Dims: List of FFN dimensions for each layer
- Layers Skipped: Which layers to remove (0-indexed)
Supported Models
Currently supported:
- Gemma 3n (
google/gemma-3n-E4B-it) - Multimodal model with vision and audio
See matformer_configs/ for available configurations.
Performance Guide
Memory Usage
Memory scales approximately with parameter count:
- Full model (3.98B): ~8GB VRAM
- E2.49B slice: ~5GB VRAM
- E2B slice (1.91B): ~4GB VRAM
- Smaller slices: Proportionally less
Inference Speed
Speed improvement is roughly linear with layer count:
- 30 layers vs 35 layers = ~14% faster
- 20 layers vs 35 layers = ~43% faster
Quality Trade-offs
Example accuracy on MMLU benchmark:
- Full model: 62.3%
- E2.98B: 59.5% (-4.5%)
- E2.49B: 54.5% (-12.5%)
- E2B: 50.9% (-18.3%)
Choose based on your requirements:
- Maximum quality: Use full model (omit matformer args)
- Balanced: E2.49B to E2.98B configurations (block-level configs recommended)
- Resource-constrained: E2B configuration (1.91B params)
- Extreme efficiency: E1.96B configuration
Advanced Usage
With Quantization
Combine Matformer with ISQ for maximum efficiency:
runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3n-E4B-it",
arch=VisionArchitecture.Gemma3n,
matformer_config_path="matformer_configs/gemma3n.csv",
matformer_slice_name="Config for E2.49B (block-level)",
),
in_situ_quant="Q4K" # 4-bit quantization
)
With Device Mapping
Matformer works seamlessly with automatic device mapping:
#![allow(unused)]
fn main() {
use mistralrs::{VisionModelBuilder, DeviceMapSetting, AutoDeviceMapParams};
let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
.with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
.with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
.with_device_mapping(DeviceMapSetting::Auto(
AutoDeviceMapParams::default_vision()
))
.build()
.await?;
}
Only active layers are loaded to GPU, saving memory.
Creating Custom Configurations
To create your own Matformer configuration:
- Start with the full model as baseline
- Identify skippable layers:
- Middle layers (10-30) are often good candidates
- Avoid early layers (feature extraction) and late layers (final representations)
- Never skip special layers (KV-sharing, attention patterns)
- Test quality degradation at each configuration
- Create CSV file with your configurations
Example minimal configuration:
name,# Layers,# Effective Params (B),FFN Hidden Dims,Layers Skipped
Tiny,15,0.8,"[4096, 4096, ...]","[5,6,7,10,11,12,15,16,17,20,21,22,25,26,27,30,31,32,33,34]"
API Reference
Command Line Arguments
--matformer-config-path PATH: Path to CSV configuration file--matformer-slice-name NAME: Exact name of slice from CSV
Python Parameters
Which.VisionPlain(
model_id: str,
arch: VisionArchitecture,
matformer_config_path: str = None, # Path to CSV
matformer_slice_name: str = None, # Slice name
# ... other parameters
)
Rust Methods
#![allow(unused)]
fn main() {
// For VisionModelBuilder
.with_matformer_config_path(path: PathBuf)
.with_matformer_slice_name(name: String)
// For TextModelBuilder (when supported)
.with_matformer_config_path(path: PathBuf)
.with_matformer_slice_name(name: String)
}
Troubleshooting
Common Issues
“Matformer slice ‘X’ not found”
- Check slice name matches exactly (case-sensitive)
- Verify CSV file path is correct
“Layers X and Y are reserved and cannot be skipped”
- Some models have special layers that must not be skipped
- Try different layer combinations
Memory not reduced as expected
- Ensure you’re using the slice (check logs)
- Skipped layers still need to be loaded initially
- Consider combining with quantization
Debugging
Enable logging to see Matformer details:
RUST_LOG=mistralrs_core=info mistralrs ...
This shows:
- Configuration file loaded
- Selected slice details
- Layers being skipped
- Final layer count
Future Plans
- Support for more model architectures
- Dynamic slice switching during runtime
- Automatic slice selection based on available resources
- Fine-tuning tools for creating new Matformer models
References
Device mapping
In mistral.rs, device mapping is automatically managed to be as performant and easy as possible. Automatic device mapping is enabled by default in the CLI/server and Python SDK and does not make any changes when the model fits entirely on the GPU.
Note
If your system has more than one CUDA device, mistral.rs will automatically use tensor parallelism. If the model does not completely fit on the available GPUs, or you wish to use automatic device mapping, you can disable tensor parallelism by setting
MISTRALRS_NO_NCCL=1.
Automatic device mapping works by prioritizing loading models into GPU memory, and any remaining parts are loaded into CPU memory. Models architectures such as vision models which greatly benefit from GPU acceleration also automatically prioritize keeping those components on the GPU.
To control the mapping across devices, you can set the following maximum parameters which the model should expect in a prompt.
- maximum sequence length (default: 4096)
- maximum batch size (default: 1)
- (vision models) maximum image length (length refers to the edge length) (default: 1024)
- (vision models) maximum number of images (default: 1)
These parameters do not translate to hard limits during runtime, they only control the mapping.
Note
The maximum sequence length is also used to ensure that a KV cache will fit for with and without PagedAttention.
Examples
- Python
- Text models text_auto_device_map.py
- Vision models vision_auto_device_map.py
- Rust
- Text models text_auto_device_map/main.rs
- Vision models vision_auto_device_map/main.rs
- Server
- Text models:
mistralrs run --isq 4 -m meta-llama/Llama-3.3-70B-Instruct --max-seq-len 4096 --max-batch-size 2- Vision models:
mistralrs run --isq 4 -m meta-llama/Llama-3.2-11B-Vision-Instruct --max-seq-len 4096 --max-batch-size 2 --max-num-images 2 --max-image-length 1024
If you want to manually device map the model (not recommended), please continue reading.
Note
Manual device mapping is deprecated in favor of automatic device mapping due to the possibility for user error in manual.
Manual device mapping
There are 2 ways to do device mapping:
- Specify the number of layers to put on the GPU - this uses the GPU with ordinal 0.
- Specify the ordinals and number of layers - this allows for cross-GPU device mapping.
The format for the ordinals and number of layers is ORD:NUM;... where ORD is the unique ordinal and NUM is the number of layers for that GPU. This may be repeated as many times as necessary.
Note: We refer to GPU layers as “device layers” throughout mistral.rs.
Example of specifying ordinals
mistralrs run -n "0:16;1:16" -m gradientai/Llama-3-8B-Instruct-262k
Note: In the Python SDK, the “0:16;1:16” string is passed as the list
["0:16", "1:16"].
Example of specifying the number of GPU layers
mistralrs run -n 16 -m gradientai/Llama-3-8B-Instruct-262k
PagedAttention in mistral.rs
Mistral.rs supports PagedAttention (paper here) to accelerate both normal inference and batched inference on:
- CUDA (Unix-like platforms such as WSL, Linux)
- Metal
Our PagedAttention implementation has 2 inputs: GPU KV cache memory size, and block size. This enables you to have fine-tuned control over the available context length, by configuring the available memory for KV cache. When using a CUDA device, PagedAttention is actiated by default but can be disabled with no_paged_attn for Python or no-paged-attn for the CLI tools.
KV Cache Quantization
PagedAttention now supports KV cache quantization to reduce memory usage and potentially improve performance. The KV cache can be quantized to FP8 (F8E4M3 format) instead of using the model’s native dtype, significantly reducing memory requirements while maintaining model quality.
Available cache types:
auto(default): Uses the model’s native dtype for KV cachef8e4m3: Quantizes KV cache to 8-bit floating point (E4M3 format)
When using FP8 quantization, the memory usage for KV cache is approximately halved compared to FP16, allowing for longer context lengths with the same GPU memory allocation.
Note: The default block size if not specified is 32.
Note: if OOM occurs (this can be caused by a variety of factors including adapter activation, re-ISQ, and others), it is likely because the PagedAttention KV cache has already been allocated. To counter this, either set the KV cache memory to a lower amount or usage percentage (recommended) or disable paged attention entirely for a dynamically allocated cache.
Note: Paged Attention is not enabled on Windows platforms, only Unix-based platforms.
Note: In the CLI and Python SDK, Paged Attention is disabled by default for Metal. It can be enabled with the
--paged-attn/paged_attnflags.
There are more features being added to this:
- GGML model support
- Adapter model support
- Speculative decoding
Prefix caching is now supported with PagedAttention. PagedAttention can leverage the prefix cacher to cache KV prefix states across iterations for faster multi-turn inference.
Block-Level Prefix Caching
Prefix caching is a technique to reuse computed KV cache blocks across requests that share common prefixes (like system prompts). This can significantly speed up inference when multiple requests use the same prefix.
How It Works
-
Block Hashing: Each block of tokens is assigned a unique hash based on its contents and the hash of its parent block:
hash(block) = hash(parent_hash, block_tokens)This creates a hash chain that uniquely identifies any prefix sequence.
-
Cache Lookup: When allocating blocks for a new request, the scheduler checks if any full blocks match existing cached blocks by comparing hashes.
-
Block Reuse: Matched blocks are reused directly - their pre-computed KV cache values are used without recomputation. Only the non-matching suffix tokens need to be processed.
-
LRU Eviction: When memory is needed, least recently used cached blocks are evicted first.
Benefits
- Multi-turn conversations: System prompts and conversation history are cached and reused
- Batched requests: Multiple requests with shared prefixes (e.g., same system prompt) benefit from caching
- Reduced TTFT: Time-to-first-token is reduced by skipping prefix computation
How It’s Enabled
Prefix caching is enabled by default when using PagedAttention and controlled by the same prefix_cache_n setting that controls the sequence-level prefix cacher:
- CLI:
--prefix-cache-n <N>(default 16). Set to 0 to disable prefix caching. - Python SDK:
prefix_cache_n=<N>(default 16). Set toNoneor0to disable. - Rust SDK:
.with_prefix_cache_n(Some(N))(default 16). PassNoneto disable.
Important: The two prefix caching systems are mutually exclusive:
- PagedAttention uses block-level prefix caching (handled by
PrefixCacherinBlockEngine) - Non-PagedAttention uses sequence-level prefix caching (handled by
PrefixCacheManagerV2)
The prefix_cache_n setting controls both systems, but only one is active depending on whether PagedAttention is enabled. You’ll see one of these log messages at startup indicating which system is active:
Prefix caching enabled (block-level, PagedAttention).Prefix caching enabled (sequence-level, non-paged attention).
Implementation Details
The prefix cache operates at the block level (not token level) for efficiency:
-
Full blocks only: Only complete blocks (block_size tokens) are cached. Partial blocks at the end of a sequence are not cached.
-
Hash chain: The hash for each block depends on all preceding blocks, ensuring the entire prefix matches.
-
Copy-on-Write: Cached blocks use reference counting. When a cached block needs modification, it’s copied first (CoW).
-
Memory management: The cache uses LRU eviction when allocating new blocks. Evicted blocks are returned to the free pool.
Performance Considerations
- Block size affects cache granularity: larger blocks = fewer cache entries but coarser matching
- Cache hit rate improves with more repeated prefixes
- Memory overhead is minimal (just hash-to-block mappings)
Supported models:
- Normal models
- GGUF models
- Vision models
Note: Prefix caching is supported when using PagedAttention. Configure the number of sequences to cache on the device with:
- CLI:
--prefix-cache-n <N>(default 16)- Python SDK:
prefix_cache_n=<N>(default 16)- Rust SDK:
.with_prefix_cache_n(Some(N))(default 16)
FlashAttention V2/V3 + PagedAttention in mistral.rs
If mistral.rs is compiled with FlashAttention and PagedAttention is enabled, then FlashAttention will be used in tandem to accelerate the prefill phase.
Using the CLI
Add the --pa-gpu-mem/--pa-gpu-mem-usage and --pa-blk-size parameters before the model kind selector. The GPU memory is in MBs and the block size means the number of tokens per block. These parameters may be passed on any supported model type.
To enable KV cache quantization, use the --pa-cache-type parameter with either auto (default) or f8e4m3.
mistralrs run --pa-memory-mb 8192 --pa-block-size 32 --isq 4 -m microsoft/Phi-3-mini-128k-instruct
mistralrs run --pa-memory-fraction 0.95 --pa-block-size 32 --format gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
Example with FP8 KV cache quantization:
mistralrs run --paged-attn on --pa-memory-mb 4096 --pa-block-size 32 --pa-cache-type f8e4m3 -m microsoft/Phi-3-mini-128k-instruct
Using the Rust SDK
You can find this example here.
use anyhow::Result;
use mistralrs::{
IsqType, MemoryGpuConfig, PagedAttentionMetaBuilder, TextMessageRole, TextMessages,
TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| {
PagedAttentionMetaBuilder::default()
.with_block_size(32)
.with_gpu_memory(MemoryGpuConfig::ContextSize(1024))
.build()
})?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Example with FP8 KV cache quantization:
use anyhow::Result;
use mistralrs::{
IsqType, MemoryGpuConfig, PagedAttentionMetaBuilder, PagedCacheType,
TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| {
PagedAttentionMetaBuilder::default()
.with_block_size(32)
.with_gpu_memory(MemoryGpuConfig::ContextSize(1024))
.with_cache_type(PagedCacheType::F8E4M3)
.build()
})?
.build()
.await?;
// ... rest of the code remains the same
}
Using the Python SDK
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="mistralai/Mistral-7B-Instruct-v0.1",
arch=Architecture.Mistral,
),
pa_gpu_mem = 4096,
pa_blk_size = 32,
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Example with FP8 KV cache quantization:
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture, PagedCacheType
runner = Runner(
which=Which.Plain(
model_id="mistralai/Mistral-7B-Instruct-v0.1",
arch=Architecture.Mistral,
),
pa_gpu_mem = 4096,
pa_blk_size = 32,
pa_cache_type = PagedCacheType.F8E4M3,
)
# ... rest of the code remains the same
Speculative Decoding
Speculative decoding is an inference acceleration technique that uses a smaller “draft” model to propose tokens, which are then validated in parallel by the larger “target” model. This can significantly speed up generation when the draft model frequently predicts tokens the target model would also choose.
Mistral.rs implements speculative decoding based on the paper: Fast Inference from Transformers via Speculative Decoding.
How It Works
- The draft model generates
gammacandidate tokens autoregressively - The target model evaluates all candidate tokens in a single forward pass
- Using rejection sampling, tokens are accepted or rejected:
- Accept if the target model’s probability >= draft model’s probability
- Otherwise, accept with probability
p_target(x) / p_draft(x) - If rejected, sample from the normalized difference distribution
This approach guarantees the same output distribution as running the target model alone, while often achieving significant speedups.
Configuration
The key parameter is gamma - the number of draft tokens to generate per speculation step. Higher values can increase throughput when the draft model is accurate, but waste computation when predictions are frequently rejected.
Recommended values: Start with gamma = 12-32 and tune based on your models and workload.
Requirements
- Same tokenizer: Both target and draft models must share the same tokenizer vocabulary
- Same model category: Both must be the same type (e.g., both text models or both vision models)
- KV cache enabled: Both models must have KV caching enabled (default behavior)
Limitations
Note: PagedAttention is not currently supported with speculative decoding.
Note: Prefix caching is not supported with speculative decoding.
Note: Hybrid KV caches are not supported with speculative decoding.
Using TOML Configuration
The recommended way to configure speculative decoding is via TOML. Create a config file (e.g., speculative.toml):
[model]
model_id = "meta-llama/Llama-3.1-8B-Instruct"
[speculative]
gamma = 12
[speculative.draft_model]
model_id = "meta-llama/Llama-3.2-1B-Instruct"
Then run with:
mistralrs run --from-toml speculative.toml
The draft model can use any supported format (Plain, GGUF, etc.) and can have different quantization than the target model.
TOML with GGUF Draft Model
[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
[speculative]
gamma = 16
[speculative.draft_model]
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
model_file = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
TOML with ISQ Quantization
[model]
model_id = "meta-llama/Llama-3.1-8B-Instruct"
[speculative]
gamma = 16
[speculative.draft_model]
model_id = "meta-llama/Llama-3.2-1B-Instruct"
isq = "Q8_0"
Using the Python SDK
from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
runner = Runner(
which=Which.Plain(
model_id="mistralai/Mistral-7B-Instruct-v0.1",
arch=Architecture.Mistral,
),
which_draft=Which.GGUF(
tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
),
speculative_gamma=32,
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Python SDK Parameters
| Parameter | Type | Description |
|---|---|---|
which_draft | Which | Draft model specification (Plain, GGUF, etc.) |
speculative_gamma | int | Number of draft tokens per step (default: 32) |
Using the Rust SDK
You can find this example at mistralrs/examples/speculative/main.rs.
use anyhow::Result;
use mistralrs::{
IsqType, RequestBuilder, SpeculativeConfig, TextMessageRole, TextMessages,
TextModelBuilder, TextSpeculativeBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let target = TextModelBuilder::new("meta-llama/Llama-3.1-8B-Instruct")
.with_logging();
let draft = TextModelBuilder::new("meta-llama/Llama-3.2-1B-Instruct")
.with_logging()
.with_isq(IsqType::Q8_0);
let spec_cfg = SpeculativeConfig { gamma: 16 };
let model = TextSpeculativeBuilder::new(target, draft, spec_cfg)?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Choosing Draft and Target Models
For best performance:
- Use the same model family - Draft models from the same family as the target (e.g., Llama 3.2-1B with Llama 3.1-8B) typically have higher acceptance rates
- Smaller is better for draft - The draft model should be significantly smaller than the target for meaningful speedup
- Quantize the draft model - Using ISQ or GGUF quantization on the draft model reduces memory and improves draft generation speed
- Tune gamma - Monitor acceptance rates and adjust gamma accordingly
Example Model Pairings
| Target Model | Draft Model | Notes |
|---|---|---|
| Llama 3.1-8B | Llama 3.2-1B | Same family, good acceptance |
| Llama 3.1-70B | Llama 3.1-8B | Large speedup potential |
| Mistral-7B | Mistral-7B (Q4_K_M GGUF) | Same model, quantized draft |
Performance Considerations
- Acceptance rate: Higher acceptance rates lead to better speedups. Monitor your logs for rejection statistics.
- Draft model overhead: If the draft model is too large relative to the target, the overhead may negate speedup benefits.
- Batch size: Speculative decoding is most beneficial for single-request scenarios. For high-throughput batch inference, standard decoding may be more efficient.
- Memory usage: Both models must fit in memory simultaneously. Consider quantizing one or both models.
Combining with Other Features
Speculative decoding can be combined with:
- ISQ quantization - Quantize target, draft, or both models
- X-LoRA adapters - Use adapters on the target model
- Device mapping - Distribute models across multiple GPUs
See examples/python/speculative_xlora.py for an example combining speculative decoding with X-LoRA.
FlashAttention in mistral.rs
Mistral.rs supports FlashAttention V2 and V3 on CUDA devices (V3 is only supported when CC >= 9.0).
Note: If compiled with FlashAttention and PagedAttention is enabled, then FlashAttention will be used in tandem to accelerate the prefill phase.
GPU Architecture Compatibility
| Architecture | Compute Capability | Example GPUs | Feature Flag |
|---|---|---|---|
| Ampere | 8.0, 8.6 | RTX 30*, A100, A40 | --features flash-attn |
| Ada Lovelace | 8.9 | RTX 40*, L40S | --features flash-attn |
| Hopper | 9.0 | H100, H800 | --features flash-attn-v3 |
| Blackwell | 10.0, 12.0 | RTX 50* | --features flash-attn |
Note: FlashAttention V2 and V3 are mutually exclusive Note: To use FlashAttention in the Python SDK, compile from source.
Multi-head Latent Attention (MLA) in mistral.rs
Multi-head Latent Attention (MLA) is an efficient attention mechanism that reduces KV cache memory usage by compressing key-value states into a low-rank latent space. This technique was introduced in DeepSeek V2 and is also used in DeepSeek V3 and GLM-4.7-Flash models.
How It Works
MLA compresses the key-value cache by:
- Projecting KV states into a compact latent representation (
kv_lora_rankdimensions) - Storing only the compressed latent vectors and rotary position embeddings in the KV cache
- Reconstructing full KV states on-the-fly during attention computation
This results in significant memory savings compared to standard multi-head attention, enabling longer context lengths with the same GPU memory.
Supported Models
MLA is automatically enabled for the following model architectures when using PagedAttention on CUDA:
| Model | Architecture | MLA Dimensions |
|---|---|---|
| DeepSeek V2 | deepseekv2 | kv_lora_rank varies |
| DeepSeek V3 | deepseekv3 | kv_lora_rank=512, kpe_head_dim=64 |
| GLM-4.7-Flash | glm4moelite | kv_lora_rank=512, kpe_head_dim=64 |
Requirements
MLA decode optimization requires:
- CUDA on Unix-like platforms (Linux, WSL)
- PagedAttention enabled
- Compatible model architecture (see table above)
When these conditions are met, MLA is automatically used during the decode phase for optimal performance.
Performance Benefits
MLA provides two key optimizations:
-
Reduced KV Cache Memory: The compressed latent representation uses significantly less memory than full key-value states, allowing for:
- Longer context lengths
- Larger batch sizes
- More efficient memory utilization
-
Optimized Decode Kernels: Custom FlashInfer-based MLA kernels accelerate single-token generation by:
- Operating directly on compressed latent states
- Avoiding repeated KV decompression
- Leveraging efficient memory access patterns
Disabling MLA
If you encounter issues or want to compare performance, you can disable MLA by setting the environment variable:
MISTRALRS_NO_MLA=1 mistralrs ...
When disabled, the model falls back to standard PagedAttention with full KV cache storage.
Technical Details
KV Cache Layout
When MLA is enabled, PagedAttention uses a specialized cache layout:
- Key cache: Stores compressed latent vectors (
kv_lora_rankdimensions) + rotary position embeddings (kpe_head_dimdimensions) - Value cache: Shares the same block structure for efficient memory management
Decode Path
During single-token generation (decode phase):
- Query is projected to latent space
- Attention is computed directly on compressed KV states using FlashInfer MLA kernels
- Output is projected back from latent space
Prefill Path
During prompt processing (prefill phase):
- Full KV states are computed for the current chunk
- Compressed latents are stored in the PagedAttention cache
- For prefix-cached sequences, latents are retrieved and decompressed as needed
See Also
- PagedAttention - Required for MLA optimization
- FlashAttention - Accelerates prefill phase
- DeepSeek V2 - Model documentation
- DeepSeek V3 - Model documentation
- GLM-4.7-Flash - Model documentation
Distributed inference in mistral.rs
Mistral.rs supports distributed inference with a few strategies
- NCCL (recommended for CUDA)
- Ring backend (supported on all devices)
What backend is best?
- For CUDA-only system: NCCL
- Anything else: Ring backend
The Ring backend is also heterogenous! This means that you can use the Ring backend on any set of multiple devices connected over TCP. For example, you can connect 2 Metal systems, or 2 Metal and 1 CPU system with the Ring backend!
NCCL in mistral.rs
Mistral.rs supports distributed inference on CUDA with Tensor Parallelism via NCCL.
Note: Multi-node support is coming! Distributed inference on Apple hardware is also being investigated.
Tensor Parallelism (TP) is automatically used to accelerate distributed inference when more than one CUDA GPUs are detected. The tensor parallelism size is always automatically set to the total number of GPUs.
TP splits the model into shards and benefits from fast single-node interconnects like NVLink. Both normal and vision models support tensor parallelism.
Important: The world size (total number of GPUs) must be a power of 2 (e.g., 1, 2, 4, 8, 16, 32, etc.). This is a requirement for optimal performance and correct operation of the distributed algorithms.
Note: In mistral.rs, if NCCL is enabled, then automatic device mapping will not be used.
Important: To build for NCCL, be sure to add the nccl feature flag (for example: --features nccl,cuda).
See the following environment variables:
| Name | Function | Usage |
|---|---|---|
MISTRALRS_NO_NCCL=1 | Disable TP and NCCL | If the model does not fit on the available CUDA devices, disabling NCCL will re-enable automatic device mapping |
Single-Node Support
Set the number of ranks using MISTRALRS_MN_LOCAL_WORLD_SIZE, e.g.,
MISTRALRS_MN_LOCAL_WORLD_SIZE=2 mistralrs serve -p 8000 -m Qwen/Qwen3-30B-A3B-Instruct-2507
where, if no MISTRALRS_MN_LOCAL_WORLD_SIZE env given, mistral.rs will split the model across all available devices.
Multi-node support
# Head node:
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_HEAD_NUM_WORKERS=1 MISTRALRS_MN_HEAD_PORT=<PORT> mistralrs run -m ...
# For the worker nodes:
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_WORKER_ID=0 MISTRALRS_WORKER_SERVER_ADDR=<HEAD ADDR>:<PORT> mistralrs run -m ...
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_WORKER_ID=1 MISTRALRS_WORKER_SERVER_ADDR=<HEAD ADDR>:<PORT> mistralrs run -m ...
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_WORKER_ID=2 MISTRALRS_WORKER_SERVER_ADDR=<HEAD ADDR>:<PORT> mistralrs run -m ...
Multi-node support in mistral.rs divides the nodes into two groups: a “head” node, and multiple “worker” nodes. Head node choice is arbitrary. For example, if a system has 8 nodes, there will be 1 “head” node, and 7 “worker” nodes.
To enable multi-node, set the MISTRALRS_MN_GLOBAL_WORLD_SIZE=<number> environment variable to the total number of GPUs in all nodes, including “head” and “worker“s. Note: This number must be a power of 2.
It is recommended to use server mode with mistral.rs when in multi-node. Currently, you must send requests to every node!
The following environment variables must be set for each node:
Head node:
| Name | Function | Usage |
|---|---|---|
MISTRALRS_MN_HEAD_NUM_WORKERS=<number> | The number of worker nodes which will be connected. | This should be the number of nodes in the system, minus 1 for the head node. |
MISTRALRS_MN_HEAD_PORT=<PORT> | The port on which to communicate with the worker nodes. | Worker nodes will connect to this port via TCP sockets |
Worker node:
| Name | Function | Usage |
|---|---|---|
MISTRALRS_MN_WORKER_ID=<number> | The 0-indexed worker ID for this worker node. | If there are 4 nodes (1 head, 3 workers), then the worker ids will be 0, 1, and 2 |
MISTRALRS_MN_WORKER_SERVER_ADDR=<ADDR>:<PORT> | The IP address and port to connect to the server. | This is used to establish communication with the head node. |
Ring backend in mistral.rs
Mistral.rs provides a TCP-based ring backend for distributed tensor-parallel inference. This backend is enabled by compiling with the ring feature and implements collective operations over a ring topology using TCP sockets.
Prerequisites
- Build with the
ringfeature enable, in addition to any others:cargo build --release --features ring - Ensure the specified TCP ports are open and reachable between processes.
- The
world_sizemust be a power of 2 (2, 4, 8, 16, etc.) for correct operation.
Configuration
Create one JSON configuration file per process with the following fields:
| Field | Type | Description |
|---|---|---|
master_ip | string | Optional. IP address for master node. |
master_port | integer | Optional. Port for master node. |
port | integer | Local port to bind for incoming connections from the left neighbor. |
right_port | integer | Port on which the right neighbor is listening (used to connect outgoing to the right). |
right_ip | string | Optional. IP address of the right neighbor (defaults to 0.0.0.0). |
rank | integer | Rank of this process in [0..world_size). |
world_size | integer | Total number of processes in the ring. Must be a power of 2 (e.g., 2, 4, 8, 16, etc.). |
This address and port should form a ring topology for each of the nodes. For example, the last node should point to the first node as its right neighbor.
Although all processes participate in collective communication, Rank 0 acts as the master node. For example, interactive mode or the server runs on Rank 0, while other ranks act as background workers.
Example ring topology:
+---------+ +---------+
| Rank 0 | -----> | Rank 1 |
| IP: A | | IP: B |
| Port: X | | Port: Y |
+----+----+ +----+----+
^ |
| v
+----+----+ +----+----+
| Rank 3 | <----- | Rank 2 |
| IP: D | | IP: C |
| Port: W | | Port: Z |
+---------+ +---------+
Each node connects to its right neighbor by IP and port, and the last node wraps around to the first.
Example for two processes:
-
{ "master_ip": "0.0.0.0", "master_port": 1234, "port": 12345, "right_port": 12346, "rank": 0, "world_size": 2 } -
{ "master_ip": "0.0.0.0", "master_port": 1234, "port": 12346, "right_port": 12345, "rank": 1, "world_size": 2 }
Multi-Machine Example
To run on different machines, update the right_ip field in each config to the actual IP address of the neighbor process. For example, if you have two machines with IPs 192.168.1.10 and 192.168.1.11:
-
ring_0.jsonon Machine A (192.168.1.10):{ "port": 12345, "right_port": 12346, "right_ip": "192.168.1.11", "rank": 0, "world_size": 2 } -
ring_1.jsonon Machine B (192.168.1.11):{ "port": 12346, "right_port": 12345, "right_ip": "192.168.1.10", "rank": 1, "world_size": 2 }
Make sure that the specified ports are open and that each machine can reach the other via TCP on those ports.
Usage
Set the RING_CONFIG environment variable to point to the JSON file for each process, then run your application built with the ring feature:
# Process 0 or computer 0
export RING_CONFIG=path/to/ring_0.json
cargo run --release --features ring -- ...
# Process 1 or computer 1
export RING_CONFIG=path/to/ring_1.json
cargo run --release --features ring -- ...
The ring backend will automatically handle collective communication for tensor-parallel inference.
Tool calling
Tool calling makes LLMs smarter.
LLMs use tool calling to interact with the outside world. Mistral.rs has OpenAI compatible support for tool calling in all APIs, HTTP, Python, and Rust.
Note that some models, such as Mistral Small/Nemo models, require a chat template to be specified. For example:
mistralrs serve -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503
OpenAI docs: https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models
We support the following models’ tool calling in OpenAI-compatible and parse native tool calling:
- Llama 4
- Llama 3.1/3.2/3.3
- Mistral Small (including 3.1 + multimodal)
- Mistral Nemo
- Hermes 2 Pro
- Hermes 3
- DeepSeek V2/V3/R1
- Qwen 3
All models that support tool calling will respond according to the OpenAI tool calling API.
OpenAI compatible HTTP example
Please see our example here.
OpenAI docs: https://platform.openai.com/docs/api-reference/chat/create?lang=curl
Rust example
Please see our example here.
Python example
Please see our notebook here.
Tool callbacks
You can override tool execution using a tool callback. The callback receives the tool name and a dictionary of arguments and must return the tool output as a string.
Python
def tool_cb(name: str, args: dict) -> str:
if name == "local_search":
return json.dumps(local_search(args.get("query", "")))
return ""
runner = Runner(
which=Which.Plain(model_id="YourModel/ID", arch=Architecture.Llama),
tool_callback=tool_cb,
)
See custom_search.py for a full
example. In Rust pass .with_tool_callback(...) to the builder as demonstrated
in custom_search/main.rs.
Search callbacks
Web search uses a DuckDuckGo-based callback by default. Provide your own search
function with search_callback in Python or .with_search_callback(...) in
Rust. Each callback should return a list of results with title, description,
url and content fields. See WEB_SEARCH.md for more details
and examples.
Web search tool in mistral.rs
mistral.rs is compatible with OpenAI’s web_search_options parameter! Once enabled, this allows web searching for models.
This works with all models that support tool calling. However, your mileage may vary depending on the specific model. The following models work during testing and are recommended for usage:
- Hermes 3 3b/8b
- Mistral 3 24b
- Llama 4 Scout/Maverick
- Qwen 3 (⭐ Recommended!)
Web search is supported both in streaming and completion responses! This makes it easy to integrate and test out in interactive mode!
Besides tool calling and parsing of web content, we also use an embedding model to select the most relevant search results.
You can use the web search tool in all the APIs: Python, Rust, and server.
Selecting a search embedding model
Internally, we now use google/embeddinggemma-300m to embed documents for ranking. You can pick from the built-in reranker variants (currently just embedding_gemma) in every API:
- Rust:
with_search(SearchEmbeddingModel::EmbeddingGemma300M)in the builder - Python:
search_embedding_model="embedding_gemma"in the Runner - Server:
--search-embedding-model embedding_gemmaflag
Specifying a custom search callback
By default, mistral.rs uses a DuckDuckGo-based search callback. To override this, you can provide your own search function:
- Rust: use
.with_search_callback(...)on the model builder with anArc<dyn Fn(&SearchFunctionParameters) -> anyhow::Result<Vec<SearchResult>> + Send + Sync>. - Python: pass the
search_callbackkeyword argument toRunner, which should be a functiondef search_callback(query: str) -> List[Dict[str, str]]returning a list of results with keys"title","description","url", and"content".
Example in Python:
def search_callback(query: str) -> list[dict[str, str]]:
# Implement your custom search logic here, returning a list of result dicts
return [
{
"title": "Example Result",
"description": "An example description",
"url": "https://example.com",
"content": "Full text content of the page",
},
# more results...
]
from mistralrs import Runner, Which, Architecture
runner = Runner(
which=Which.Plain(model_id="YourModel/ID", arch=Architecture.Mistral),
enable_search=True,
search_callback=search_callback,
)
HTTP server
Be sure to add --enable-search!
Here are some examples using various models. Note that this works for both streaming and completion requests, so interactive mode is featured here!
mistralrs run --enable-search --isq 4 -m Qwen/Qwen3-4B
mistralrs serve --enable-search -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503
mistralrs run --enable-search --isq 4 -m NousResearch/Hermes-3-Llama-3.1-8B
from openai import OpenAI
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
messages = [
{
"role": "user",
"content": "Can you show me some code using mistral.rs for running Llama 3.2 Vision?",
}
]
completion = client.chat.completions.create(
model="default",
messages=messages,
tool_choice="auto",
max_tokens=1024,
web_search_options={},
)
# print(completion.usage)
print(completion.choices[0].message.content)
if completion.choices[0].message.tool_calls is not None:
# Should never happen.
tool_called = completion.choices[0].message.tool_calls[0].function
print(tool_called)
Python SDK
from mistralrs import (
Runner,
Which,
ChatCompletionRequest,
Architecture,
WebSearchOptions,
)
# Define a custom search callback if desired
def my_search_callback(query: str) -> list[dict[str, str]]:
# Fetch or compute search results here
return [
{
"title": "Mistral.rs GitHub",
"description": "Official mistral.rs repository",
"url": "https://github.com/EricLBuehler/mistral.rs",
"content": "mistral.rs is a Rust binding for Mistral models...",
},
]
runner = Runner(
which=Which.Plain(
model_id="NousResearch/Hermes-3-Llama-3.1-8B",
arch=Architecture.Llama,
),
enable_search=True,
search_callback=my_search_callback,
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": "Can you show me some code using mistral.rs for running Llama 3.2 Vision?",
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
web_search_options=WebSearchOptions(
search_context_size=None, user_location=None
),
)
)
print(res.choices[0].message.content)
print(res.usage)
Rust SDK
use anyhow::Result;
use mistralrs::{
SearchEmbeddingModel, IsqType, RequestBuilder, TextMessageRole, TextMessages, TextModelBuilder,
WebSearchOptions,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("NousResearch/Hermes-3-Llama-3.1-8B")
.with_isq(IsqType::Q4K)
.with_logging()
.with_search(SearchEmbeddingModel::default())
.build()
.await?;
let messages = TextMessages::new().add_message(
TextMessageRole::User,
"What is the weather forecast for Boston?",
);
let messages =
RequestBuilder::from(messages).with_web_search_options(WebSearchOptions::default());
let response = model.send_chat_request(messages).await?;
println!("What is the weather forecast for Boston?\n\n");
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
Chat templates and tokenizer customization
JINJA chat templates (recommended method)
Some models do not come with support for tool calling or other features, and as such it might be necessary to specify your own chat template.
We provide some chat templates here, and it is easy to modify or create others to customize chat template behavior.
To use this, add the jinja-explicit parameter to the various APIs
mistralrs serve -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503
Chat template overrides
Mistral.rs attempts to automatically load a chat template from the tokenizer_config.json file. This enables high flexibility across instruction-tuned models and ensures accurate chat templating. However, if the chat_template field is missing, then a JINJA chat template should be provided. The JINJA chat template may use messages, add_generation_prompt, bos_token, eos_token, and unk_token as inputs.
We provide some chat templates here, and it is easy to modify or create others to customize chat template behavior.
For example, to use the chatml template, --chat-template is specified before the model architecture. For example:
mistralrs serve -p 1234 --log output.log --chat-template ./chat_templates/chatml.json -m meta-llama/Llama-3.2-3B-Instruct
Note: For GGUF models, the chat template may be loaded directly from the GGUF file by omitting any other chat template sources.
Tokenizer
Some models do not provide a tokenizer.json file although mistral.rs expects one. To solve this, please run this script. It will output the tokenizer.json file for your specific model. This may be used by passing the --tokenizer-json flag after the model architecture. For example:
$ python3 scripts/get_tokenizers_json.py
Enter model ID: microsoft/Orca-2-13b
$ mistralrs serve -p 1234 --log output.log -m microsoft/Orca-2-13b --tokenizer-json tokenizer.json
Putting it all together, to run, for example, an Orca model (which does not come with a tokenizer.json or chat template):
- Generate the
tokenizer.jsonby running the script atscripts/get_tokenizers_json.py. This will output some files includingtokenizer.jsonin the working directory. - Find and copy the correct chat template from
chat-templatesto the working directory (eg.,cp chat_templates/chatml.json .) - Run
mistralrs serve, specifying the tokenizer and chat template:mistralrs serve -p 1234 --log output.txt --chat-template chatml.json -m microsoft/Orca-2-13b -t tokenizer.json
Note: For GGUF models, the tokenizer may be loaded directly from the GGUF file by omitting the tokenizer model ID.
Sampling and penalty techniques in mistral.rs
mistral.rs supports a comprehensive set of sampling and penalty techniques to control text generation. These can be configured via the HTTP API, Python SDK, or Rust SDK.
Temperature
Controls the randomness of token selection. Lower values make output more deterministic, higher values increase creativity and randomness.
- Range: 0.0 to 2.0 (typically 0.0 to 1.0)
- Default: Model-dependent, usually around 0.7
- Effect: At 0.0, always selects the most likely token (greedy). At higher values, sampling becomes more diverse.
Top K
Limits token selection to the K most likely tokens.
- Range: 1 to vocabulary size
- Effect: Lower values restrict choices to only the most probable tokens, reducing randomness.
Top P (Nucleus Sampling)
Limits token selection to the smallest set of tokens whose cumulative probability exceeds P.
- Range: 0.0 to 1.0
- Effect: At 0.1, only tokens comprising the top 10% probability mass are considered. More adaptive than Top K as it adjusts based on the probability distribution.
Min P
Filters out tokens with probability less than min_p * max_probability.
- Range: 0.0 to 1.0
- Effect: Removes low-probability tokens relative to the most likely token. Useful for preventing unlikely tokens from being selected.
Stop Sequences
Strings that, when generated, cause generation to stop immediately.
- Type: Array of strings
- Effect: Generation terminates as soon as any stop sequence is produced. Useful for controlling output boundaries.
Repetition Penalty
Applies a multiplicative penalty to tokens that have already appeared in the context.
- Range: Typically 1.0 to 2.0
- Effect: Values > 1.0 make repeated tokens less likely. This is distinct from frequency and presence penalties.
Frequency Penalty
Penalizes tokens based on how many times they’ve appeared in the generated text so far.
- Range: -2.0 to 2.0
- Effect: Positive values reduce repetition proportionally to token frequency. Negative values encourage repetition.
Presence Penalty
Penalizes tokens that have appeared at least once in the generated text.
- Range: -2.0 to 2.0
- Effect: Positive values discourage any repetition (binary penalty). Negative values encourage reusing tokens.
DRY (Don’t Repeat Yourself) Penalty
An advanced anti-repetition technique that detects and penalizes repeated sequences of tokens, not just individual tokens. See the original implementation for details.
DRY Parameters
dry_multiplier: Controls the strength of the penalty. Higher values more strongly discourage repetition.dry_base: Base value for the exponential penalty calculation.dry_allowed_length: Minimum sequence length before the penalty applies. Sequences shorter than this are not penalized.dry_sequence_breakers: Array of tokens (like newlines, punctuation) that reset the sequence tracking. When these tokens appear, the DRY penalty starts fresh.
Example DRY Configuration (HTTP API)
{
"dry_multiplier": 0.8,
"dry_base": 1.75,
"dry_allowed_length": 2,
"dry_sequence_breakers": ["\n", ".", "!", "?", ";"]
}
API Usage
All sampling parameters can be set in API requests:
HTTP API
{
"model": "default",
"messages": [...],
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"min_p": 0.05,
"repetition_penalty": 1.1,
"frequency_penalty": 0.5,
"presence_penalty": 0.5,
"stop": ["END", "\n\n"],
"dry_multiplier": 0.8,
"dry_base": 1.75,
"dry_allowed_length": 2,
"dry_sequence_breakers": ["\n"]
}
Python SDK
response = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[...],
temperature=0.7,
top_p=0.9,
top_k=40,
min_p=0.05,
repetition_penalty=1.1,
frequency_penalty=0.5,
presence_penalty=0.5,
stop_seqs=["END", "\n\n"],
dry_multiplier=0.8,
dry_base=1.75,
dry_allowed_length=2,
dry_sequence_breakers=["\n"],
)
)
Please suggest more sampling techniques by raising an issue!
Structured model loading with .toml files
Mistral.rs supports loading models from a .toml file, and the fields are the same as for the CLI. Please find some example toml selectors here.
There are a few cases which add functionality that cannot be found in the CLI.
Speculative decoding
What to specify
Under [speculative]
- Specify the
gammaparameter
Under [speculative.draft_model]
- Choose a draft model, just like under
[model](only requirement is that they have the same tokenizer)
[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"
[speculative]
gamma = 32
[speculative.draft_model]
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
quantized_model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
quantized_filename = "mistral-7b-instruct-v0.1.Q2_K.gguf"
mistralrs from-config -f toml-selectors/speculative-gguf.toml
AnyMoE
What to specify
Under [anymoe], required unless specified
- Specify the dataset
- Find and specify the prefix/mlp values
- Go to
https://huggingface.co/<MODEL ID>/tree/main?show_file_info=model.safetensors.index.json - Look for the mlp layers: For example
model.layers.27.mlp.down_proj.weightmeans that the prefix ismodel.layersand the mlp ismlp.
- Go to
- Specify the expert or LoRA adapter model IDs
- (Optional) Specify layers to apply AnyMoE to.
Under [anymoe.config]
- Hidden size, typically found at
https://huggingface.co/<BASE MODEL ID>/blob/main/config.json
(For LoRA experts) Under [anymoe.config.expert_type.lora_adapter]
- Rank
- Alpha
- Target modules
mistralrs from-config -f toml-selectors/anymoe.toml
With fine-tuned experts
[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"
[anymoe]
dataset_json = "test.csv"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["HuggingFaceH4/zephyr-7b-beta"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[anymoe.config]
hidden_size = 4096
expert_type = "fine_tuned"
With LoRA adapter experts
[model]
model_id = "HuggingFaceH4/zephyr-7b-beta"
arch = "mistral"
[anymoe]
dataset_json = "test.csv"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["EricB/example_adapter"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[anymoe.config]
hidden_size = 4096
[anymoe.config.expert_type.lora_adapter]
rank = 16
alpha = 16
target_modules = ["gate_proj"]
Multi-Model Support
The mistralrs CLI supports loading and serving multiple models simultaneously, allowing you to switch between different models in the same server instance.
- Each model runs in its own engine thread
- Models can have different configurations (quantization, device layers, etc.)
- Memory usage scales with the number of loaded models
- All models share the same server configuration (port, logging, etc.)
- Interactive mode uses the default model or the first model if no default is set
- You can unload all models (including the last one) - they will auto-reload when accessed
Usage
Single-Model Mode (Default)
# Traditional usage - loads one model
mistralrs serve -p 1234 -m meta-llama/Llama-3.2-3B-Instruct
Multi-Model Mode
# Load multiple models from configuration file
mistralrs from-config --file config.toml
Configuration File Format
Create a JSON file with model configurations as object keys:
{
"llama3-3b": {
"alias": "llama3-3b",
"Plain": {
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
},
"qwen3-4b": {
"alias": "qwen3-4b",
"Plain": {
"model_id": "Qwen/Qwen3-4B"
},
"in_situ_quant": "Q4K"
}
}
Configuration Structure
- Object keys (e.g.,
"llama3-3b","qwen3-4b"): Organizational labels (for human readability) - API identifiers: By default the pipeline name (usually the
model_idinside the model spec). You can override this withalias. - Model specification: The model type and configuration (same format as CLI subcommands)
- Optional fields:
alias: Custom model ID (nickname) used in API requestschat_template: Custom chat templatejinja_explicit: JINJA template filenum_device_layers: Device layer configurationin_situ_quant: In-situ quantization setting
How API identifiers work:
- ✅ Object keys are organizational only (for config readability)
- ✅ If
aliasis set, it becomes the API model ID - ✅ Otherwise, the pipeline name (usually the
model_idfield) is used - ✅ The canonical pipeline name remains accepted as an alias for compatibility
API Usage
Selecting Models in Requests
Use the model field in your requests to specify which model to use:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-3b",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Default Model Behavior
- Explicit model: Use the alias if configured (e.g.,
"llama3-3b"), otherwise the full pipeline name (e.g.,"meta-llama/Llama-3.2-3B-Instruct") - Default model: Use
"default"to explicitly request the default model - Auto-fallback: If the
modelfield is omitted entirely, the default model will be used
# Use default model explicitly
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}]
}'
The default model is either:
- The model specified with
--default-model-idwhen starting the server - The first model loaded (if no default is explicitly set)
List Available Models
curl http://localhost:1234/v1/models
Returns:
{
"object": "list",
"data": [
{
"id": "default",
"object": "model",
"created": 1234567890,
"owned_by": "local"
},
{
"id": "llama3-3b",
"object": "model",
"created": 1234567890,
"owned_by": "local"
},
{
"id": "qwen3-4b",
"object": "model",
"created": 1234567890,
"owned_by": "local"
}
]
}
Note: The "default" model is always listed first and represents the server’s default model. If aliases are configured, they will appear in the list while the canonical pipeline names remain accepted.
CLI Arguments
Use the multi-model subcommand with these options:
--config <PATH>(required): Path to the JSON configuration file--default-model-id <ID>(optional): Default model ID for requests that don’t specify a model (alias or pipeline name)
New syntax:
mistralrs from-config --file <CONFIG>
Examples
Example 1: Text Models
{
"llama3-3b": {
"Plain": {
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
},
"qwen3-4b": {
"Plain": {
"model_id": "Qwen/Qwen3-4B"
},
"in_situ_quant": "Q4K"
}
}
Example 2: Mixed Model Types
{
"text-model": {
"Plain": {
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
},
"vision-model": {
"VisionPlain": {
"model_id": "google/gemma-3-4b-it"
}
}
}
Example 3: GGUF Models
{
"llama-gguf": {
"GGUF": {
"tok_model_id": "meta-llama/Llama-3.2-3B-Instruct",
"quantized_model_id": "bartowski/Llama-3.2-3B-Instruct-GGUF",
"quantized_filename": "Llama-3.2-3B-Instruct-Q4_K_M.gguf"
}
}
}
Model Unloading and Reloading
You can dynamically unload models to free memory and reload them on demand. This is useful for managing GPU memory when working with multiple large models.
Unload a Model
Unload a model from memory while preserving its configuration for later reload:
curl -X POST http://localhost:1234/v1/models/unload \
-H "Content-Type: application/json" \
-d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'
Response:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"status": "unloaded"
}
Reload a Model
Manually reload a previously unloaded model:
curl -X POST http://localhost:1234/v1/models/reload \
-H "Content-Type: application/json" \
-d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'
Response:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"status": "loaded"
}
Check Model Status
Get the current status of a specific model:
curl -X POST http://localhost:1234/v1/models/status \
-H "Content-Type: application/json" \
-d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'
Response:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"status": "loaded"
}
Possible status values:
loaded: Model is loaded and readyunloaded: Model is unloaded but can be reloadedreloading: Model is currently being reloadednot_found: Model ID not recognizedno_loader_config: Model cannot be reloaded (missing loader configuration)internal_error: An internal error occurred
Auto-Reload
When a request is sent to an unloaded model, it will automatically reload before processing the request. This enables a “lazy loading” pattern where models are only loaded when needed.
List Models with Status
The /v1/models endpoint now includes status information:
curl http://localhost:1234/v1/models
Response:
{
"object": "list",
"data": [
{
"id": "default",
"object": "model",
"created": 1234567890,
"owned_by": "local"
},
{
"id": "meta-llama/Llama-3.2-3B-Instruct",
"object": "model",
"created": 1234567890,
"owned_by": "local",
"status": "loaded"
},
{
"id": "Qwen/Qwen3-4B",
"object": "model",
"created": 1234567890,
"owned_by": "local",
"status": "unloaded"
}
]
}
Rust SDK Usage
The mistralrs crate provides MultiModelBuilder for loading multiple models and Model methods for multi-model management.
Loading Multiple Models
By default, model IDs are the pipeline names (usually the HuggingFace model path, e.g., "google/gemma-3-4b-it"). You can provide custom aliases with add_model_with_alias for shorter IDs.
use mistralrs::{IsqType, MultiModelBuilder, TextModelBuilder, VisionModelBuilder, TextMessages, TextMessageRole};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Build a multi-model instance with a vision model and a text model
// Use aliases for shorter model IDs in requests
let model = MultiModelBuilder::new()
.add_model_with_alias(
"gemma-vision",
VisionModelBuilder::new("google/gemma-3-4b-it") // Vision model
.with_isq(IsqType::Q4K)
.with_logging(),
)
.add_model_with_alias(
"qwen-text",
TextModelBuilder::new("Qwen/Qwen3-4B") // Text model
.with_isq(IsqType::Q4K),
)
.with_default_model("gemma-vision")
.build()
.await?;
// Send request to default model
let messages = TextMessages::new()
.add_message(TextMessageRole::User, "Hello!");
let response = model.send_chat_request(messages).await?;
// Send request to specific model using its alias
let messages = TextMessages::new()
.add_message(TextMessageRole::User, "Hello from Qwen!");
let response = model.send_chat_request_with_model(messages, Some("qwen-text")).await?;
Ok(())
}
Model Management Methods
#![allow(unused)]
fn main() {
// List all models (returns aliases if configured, otherwise pipeline names)
let models = model.list_models()?;
// Get/set default model
let default = model.get_default_model_id()?;
model.set_default_model_id("qwen-text")?;
// List models with status
let status = model.list_models_with_status()?;
// Returns Vec<(String, ModelStatus)> where ModelStatus is Loaded, Unloaded, or Reloading
// Check if a model is loaded
let is_loaded = model.is_model_loaded("gemma-vision")?;
// Unload a model to free memory
model.unload_model("gemma-vision")?;
// Reload when needed
model.reload_model("gemma-vision").await?;
}
Available _with_model Methods
All request methods have _with_model variants that accept an optional model ID:
send_chat_request_with_model(request, model_id: Option<&str>)stream_chat_request_with_model(request, model_id: Option<&str>)generate_image_with_model(..., model_id: Option<&str>)generate_speech_with_model(prompt, model_id: Option<&str>)generate_embeddings_with_model(request, model_id: Option<&str>)tokenize_with_model(..., model_id: Option<&str>)detokenize_with_model(..., model_id: Option<&str>)config_with_model(model_id: Option<&str>)max_sequence_length_with_model(model_id: Option<&str>)re_isq_model_with_model(isq_type, model_id: Option<&str>)
When model_id is None, the default model is used. If aliases are configured, you can pass either the alias or the canonical pipeline name.
Python SDK Usage
The Python Runner class supports multi-model operations directly.
Basic Usage
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture, Architecture
# Create a runner with a vision model (Gemma 3 4B)
runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3-4b-it",
arch=VisionArchitecture.Gemma3,
),
in_situ_quant="Q4K",
)
# Or create a runner with a text model (Qwen3 4B)
# runner = Runner(
# which=Which.Plain(
# model_id="Qwen/Qwen3-4B",
# arch=Architecture.Qwen3,
# ),
# in_situ_quant="Q4K",
# )
# List models
models = runner.list_models()
print(f"Available models: {models}")
# Get/set default model
default = runner.get_default_model_id()
runner.set_default_model_id("google/gemma-3-4b-it")
# Send request with specific model_id
request = ChatCompletionRequest(
messages=[{"role": "user", "content": "Hello!"}]
)
response = runner.send_chat_completion_request(request, model_id=models[0])
If aliases are configured (for example via the server config or Rust MultiModelBuilder), list_models() will return those aliases and you can pass them in model_id. The canonical pipeline names remain accepted.
Model Management
# List models with their status
status = runner.list_models_with_status()
# Returns list of (model_id, status) tuples
# Check if a model is loaded
is_loaded = runner.is_model_loaded("google/gemma-3-4b-it")
# Unload a model to free memory
runner.unload_model("google/gemma-3-4b-it")
# Reload when needed
runner.reload_model("google/gemma-3-4b-it")
Request Methods with model_id
All request methods accept an optional model_id parameter:
# Chat completion
response = runner.send_chat_completion_request(request, model_id="model-id")
# Completion
response = runner.send_completion_request(request, model_id="model-id")
# Embeddings
embeddings = runner.send_embedding_request(request, model_id="model-id")
# Image generation
image = runner.generate_image(prompt, response_format, model_id="model-id")
# Speech generation
audio = runner.generate_audio(prompt, model_id="model-id")
# Tokenization
tokens = runner.tokenize_text(text, add_special_tokens=True, model_id="model-id")
text = runner.detokenize_text(tokens, skip_special_tokens=True, model_id="model-id")
When model_id is None or omitted, the default model is used.
Migration Guide
From MultiModel (Rust)
The MultiModel struct has been removed. Use Model directly with MultiModelBuilder:
#![allow(unused)]
fn main() {
// Old (deprecated)
let multi = MultiModel::new(...);
multi.send_chat_request_to_model(request, "model-id").await?;
// New - model IDs are pipeline names by default (aliases optional)
let model = MultiModelBuilder::new()
.add_model(VisionModelBuilder::new("google/gemma-3-4b-it"))
.add_model(TextModelBuilder::new("Qwen/Qwen3-4B"))
.build()
.await?;
model.send_chat_request_with_model(request, Some("Qwen/Qwen3-4B")).await?;
}
From MultiModelRunner (Python)
The MultiModelRunner class has been removed. Use Runner directly:
# Old (deprecated)
multi_runner = MultiModelRunner(runner)
multi_runner.send_chat_completion_request_to_model(request, "model-id")
# New - model IDs are the registered IDs (aliases if configured)
runner = Runner(which=Which.Plain(model_id="google/gemma-3-4b-it", ...))
runner.send_chat_completion_request(request, model_id="google/gemma-3-4b-it")
MCP (Model Context Protocol) Client
mistral.rs includes a built-in MCP client that allows models to connect to external tools and services through the Model Context Protocol. This enables automatic tool discovery and usage from any MCP-compatible server.
Quick Start
Examples below show HTTP (Hugging Face), Process (filesystem), and WebSocket transports. Replace hf_xxx with your actual Hugging Face token for HTTP examples.
Rust SDK
use mistralrs::{
TextModelBuilder, McpClientConfig, McpServerConfig, McpServerSource,
TextMessages, TextMessageRole,
};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Process example (filesystem server - recommended for getting started)
let mcp_config = McpClientConfig {
servers: vec![McpServerConfig {
name: "Filesystem Tools".to_string(),
source: McpServerSource::Process {
command: "npx".to_string(),
args: vec!["@modelcontextprotocol/server-filesystem".to_string(), ".".to_string()],
work_dir: None,
env: None,
},
..Default::default()
}],
auto_register_tools: true,
..Default::default()
};
// Alternative HTTP example (Hugging Face MCP server)
let _mcp_config_http = McpClientConfig {
servers: vec![McpServerConfig {
id: "hf_server".to_string(),
name: "Hugging Face MCP".to_string(),
source: McpServerSource::Http {
url: "https://hf.co/mcp".to_string(),
timeout_secs: Some(30),
headers: None,
},
enabled: false, // Disabled by default
tool_prefix: Some("hf".to_string()),
resources: None,
bearer_token: Some("hf_xxx".to_string()), // Your HF token
}],
auto_register_tools: true,
tool_timeout_secs: Some(30),
max_concurrent_calls: Some(5),
};
// Alternative WebSocket example
let _mcp_config_websocket = McpClientConfig {
servers: vec![McpServerConfig {
name: "WebSocket Example".to_string(),
source: McpServerSource::WebSocket {
url: "wss://api.example.com/mcp".to_string(),
timeout_secs: Some(30),
headers: None,
},
enabled: false, // Disabled by default
..Default::default()
}],
auto_register_tools: true,
..Default::default()
};
// Build model with MCP support
let model = TextModelBuilder::new("Qwen/Qwen3-4B")
.with_mcp_client(mcp_config)
.build()
.await?;
// Use the model - tools are automatically available
let messages = TextMessages::new()
.add_message(
TextMessageRole::User,
"List the files in the current directory and create a test.txt file"
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
Ok(())
}
Python SDK
import mistralrs
# Process example (filesystem server - recommended for getting started)
filesystem_server = mistralrs.McpServerConfigPy(
name="Filesystem Tools",
source=mistralrs.McpServerSourcePy.Process(
command="npx",
args=["@modelcontextprotocol/server-filesystem", "."],
work_dir=None,
env=None
)
)
# Alternative HTTP example (Hugging Face MCP server)
hf_server = mistralrs.McpServerConfigPy(
id="hf_server",
name="Hugging Face MCP",
source=mistralrs.McpServerSourcePy.Http(
url="https://hf.co/mcp",
timeout_secs=30,
headers=None
),
enabled=False, # Disabled by default
tool_prefix="hf",
resources=None,
bearer_token="hf_xxx" # Your HF token
)
# Alternative WebSocket example
websocket_server = mistralrs.McpServerConfigPy(
name="WebSocket Example",
source=mistralrs.McpServerSourcePy.WebSocket(
url="wss://api.example.com/mcp",
timeout_secs=30,
headers=None
),
enabled=False # Disabled by default
)
# Create MCP client config using filesystem server (others are disabled)
mcp_config = mistralrs.McpClientConfigPy(
servers=[filesystem_server], # hf_server, websocket_server can be added when enabled
auto_register_tools=True,
tool_timeout_secs=30,
max_concurrent_calls=5
)
# Build model with MCP support
runner = mistralrs.Runner(
which=mistralrs.Which.Plain(
model_id="Qwen/Qwen3-4B",
arch=mistralrs.Architecture.Qwen3,
),
mcp_client_config=mcp_config
)
# Use the model - tools are automatically available
res = runner.send_chat_completion_request(
mistralrs.ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "List the files in the current directory and create a test.txt file"}
],
max_tokens=500,
temperature=0.1,
)
)
print(res.choices[0].message.content)
HTTP API
- Create
mcp-config.json:
Process Example (Recommended for getting started):
{
"servers": [{
"name": "Filesystem Tools",
"source": {
"type": "Process",
"command": "npx",
"args": ["@modelcontextprotocol/server-filesystem", "."]
}
}],
"auto_register_tools": true
}
Note: To install the filesystem server, run:
npx @modelcontextprotocol/server-filesystem . -y
HTTP Example (Hugging Face MCP Server):
{
"servers": [
{
"name": "Hugging Face MCP",
"source": {
"type": "Http",
"url": "https://hf.co/mcp",
"timeout_secs": 30
},
"bearer_token": "hf_xxx",
"tool_prefix": "hf",
"enabled": false
},
{
"name": "Filesystem Tools",
"source": {
"type": "Process",
"command": "npx",
"args": ["@modelcontextprotocol/server-filesystem", "."]
}
}
],
"auto_register_tools": true,
"tool_timeout_secs": 30,
"max_concurrent_calls": 5
}
WebSocket Example:
{
"servers": [
{
"name": "WebSocket Example",
"source": {
"type": "WebSocket",
"url": "wss://api.example.com/mcp",
"timeout_secs": 30
},
"enabled": false
},
{
"name": "Filesystem Tools",
"source": {
"type": "Process",
"command": "npx",
"args": ["@modelcontextprotocol/server-filesystem", "."]
}
}
],
"auto_register_tools": true
}
- Start server with MCP:
mistralrs serve \
-p 1234 \
--mcp-config mcp-config.json \
-m Qwen/Qwen3-4B
- Use the API:
curl -X POST http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"messages": [
{"role": "user", "content": "List the files in the current directory and create a test.txt file"}
],
"max_tokens": 500,
"temperature": 0.1
}'
Key Features
- Automatic Tool Discovery: Tools are discovered from MCP servers at startup
- Multi-Server Support: Connect to multiple MCP servers simultaneously
- Transport Flexibility: HTTP, WebSocket, and Process transports supported
- Authentication: Bearer token support for secure connections
- Tool Prefixing: Avoid naming conflicts between servers
- Concurrency Control: Limit parallel tool executions
- Timeout Management: Control individual tool execution timeouts
Next Steps
- Configuration Reference - Detailed configuration options
- Transport Types - HTTP, WebSocket, and Process transports
- Advanced Usage - Multi-server setups, custom headers, and more
- MCP Server Development - Building your own MCP server
Common MCP Servers
- Filesystem:
@modelcontextprotocol/server-filesystem- Local file operations (Process) - Hugging Face:
https://hf.co/mcp- Access HF models, datasets, and spaces (HTTP) - Postgres:
@modelcontextprotocol/server-postgres- Database operations (Process)
Additional servers (install separately):
- Brave Search - Web search capabilities
- GitHub - GitHub API access
Replace placeholder tokens and URLs with actual values for your use case.
Troubleshooting
Common Issues
“MCP server failed to start” or “npx command not found”
- Install Node.js and npm:
curl -fsSL https://deb.nodesource.com/setup_lts.x | sudo -E bash - && sudo apt-get install -y nodejs - Install the filesystem server:
npx @modelcontextprotocol/server-filesystem . -y
“No tools available” or “tools_available: false”
- Check server logs for MCP connection errors
- Verify the MCP config file path is correct
- Ensure the MCP server process is running:
ps aux | grep mcp
“Tool call failed” or timeout errors
- Increase
tool_timeout_secsin your config (default: 30) - Check
max_concurrent_callssetting (start with 1-5) - Verify file permissions for filesystem operations
Authentication errors with HTTP servers
- Double-check
bearer_tokenvalues (e.g., HF tokens start withhf_) - Verify API endpoints are accessible:
curl -H "Authorization: Bearer YOUR_TOKEN" https://hf.co/mcp
Need help?
- MCP Server Registry - Find more servers
- Discord Community - Get support
MCP protocol support
mistralrs serve can speak the MCP – Model-Control-Protocol in addition to the regular OpenAI-compatible REST API.
At a high-level, MCP is an opinionated, tool-based JSON-RPC 2.0 protocol that lets clients interact with models through structured tool calls instead of specialised HTTP routes.
The implementation in Mistral.rs is powered by rust-mcp-sdk and automatically registers tools based on the modalities supported by the loaded model (text, vision, …).
Exposed tools:
| Tool | Minimum input -> output modalities | Description |
|---|---|---|
chat | Text → Text | Wraps the OpenAI /v1/chat/completions endpoint |
ToC
Running
Start the normal HTTP server and add the --mcp-port flag to expose an MCP endpoint in parallel on a separate port:
mistralrs serve \
-p 1234 \
--mcp-port 4321 \
-m mistralai/Mistral-7B-Instruct-v0.3
Check if it’s working
The following curl command lists the tools advertised by the server and therefore serves as a quick smoke-test:
curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}'
Example clients
Python
The reference Python SDK can be installed via:
pip install --upgrade mcp
Here is a minimal end-to-end example that initialises a session, lists the available tools and finally sends a chat request:
import asyncio
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
SERVER_URL = "http://localhost:4321/mcp"
async def main() -> None:
# The helper creates an SSE (Server-Sent-Events) transport under the hood
async with streamablehttp_client(SERVER_URL) as (read, write, _):
async with ClientSession(read, write) as session:
# --- INITIALIZE ---
init_result = await session.initialize()
print("Server info:", init_result.serverInfo)
# --- LIST TOOLS ---
tools = await session.list_tools()
print("Available tools:", [t.name for t in tools.tools])
# --- CALL TOOL ---
resp = await session.call_tool(
"chat",
arguments={
"messages": [
{"role": "user", "content": "Hello MCP 👋"},
{"role": "assistant", "content": "Hi there!"}
],
"maxTokens": 50,
"temperature": 0.7,
},
)
# resp.content is a list[CallToolResultContentItem]; extract text parts
text = "\n".join(c.text for c in resp.content if c.type == "text")
print("Model replied:", text)
if __name__ == "__main__":
asyncio.run(main())
Rust
use anyhow::Result;
use rust_mcp_sdk::{
mcp_client::client_runtime,
schema::{
CallToolRequestParams, ClientCapabilities, CreateMessageRequest,
Implementation, InitializeRequestParams, Message, LATEST_PROTOCOL_VERSION,
},
ClientSseTransport, ClientSseTransportOptions,
};
struct Handler;
#[async_trait::async_trait]
impl rust_mcp_sdk::mcp_client::ClientHandler for Handler {}
#[tokio::main]
async fn main() -> Result<()> {
let transport = ClientSseTransport::new(
"http://localhost:4321/mcp",
ClientSseTransportOptions::default(),
)?;
let details = InitializeRequestParams {
capabilities: ClientCapabilities::default(),
client_info: Implementation { name: "mcp-client".into(), version: "0.1".into() },
protocol_version: LATEST_PROTOCOL_VERSION.into(),
};
let client = client_runtime::create_client(details, transport, Handler);
client.clone().start().await?;
let req = CreateMessageRequest {
model: "mistralai/Mistral-7B-Instruct-v0.3".into(),
messages: vec![Message::user("Explain Rust ownership.")],
..Default::default()
};
let result = client
.call_tool(CallToolRequestParams::new("chat", req.into()))
.await?;
println!("{}", result.content[0].as_text_content()?.text);
client.shut_down().await?;
Ok(())
}
HTTP
Call a tool:
curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "chat",
"arguments": {
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello, what’s the time?" }
],
"maxTokens": 50,
"temperature": 0.7
}
}
}'
Initialize:
curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {}
}'
List tools:
curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}'
Limitations & roadmap
The MCP support that ships with the current Mistral.rs release focuses on the happy-path. A few niceties have not yet been implemented and PRs are more than welcome:
- Streaming token responses (similar to the
stream=trueflag in the OpenAI API). - An authentication layer – if you are exposing the MCP port publicly run it behind a reverse-proxy that handles auth (e.g. nginx + OIDC).
- Additional tools for other modalities such as vision or audio once the underlying crates stabilise.
If you would like to work on any of the above please open an issue first so the work can be coordinated.
MCP Configuration Reference
This page provides a complete reference for configuring the MCP client in mistral.rs.
Quick Start - Minimal Configuration
For simple use cases, you can now use a minimal configuration that leverages smart defaults:
{
"servers": [{
"name": "Hugging Face MCP Server",
"source": {
"type": "Http",
"url": "https://hf.co/mcp"
},
"bearer_token": "hf_xxx"
}]
}
This automatically provides:
- UUID-based server ID: Unique identifier generated automatically
- Enabled by default: Server is active without explicit
enabled: true - UUID-based tool prefix: Prevents naming conflicts automatically
- No timeouts: Tools and connections don’t timeout by default
- Sequential execution: Only 1 concurrent tool call to prevent overwhelming servers
- Auto-registration: Tools are automatically discovered and registered
Configuration Structure
McpClientConfig
The top-level configuration for the MCP client:
{
"servers": [...], // Array of MCP server configurations
"auto_register_tools": true, // Automatically register discovered tools (default: true)
"tool_timeout_secs": null, // Timeout for individual tool calls, null = no timeout (default: null)
"max_concurrent_calls": 1 // Maximum concurrent tool executions (default: 1)
}
McpServerConfig
Configuration for each MCP server:
{
"id": "unique_id", // Unique identifier (default: UUID if not specified)
"name": "Display Name", // Human-readable name
"source": {...}, // Transport configuration (see below)
"enabled": true, // Enable/disable this server (default: true)
"tool_prefix": "mcp_abc123", // Prefix for tool names (default: UUID-based if not specified)
"resources": ["pattern"], // Optional resource patterns
"bearer_token": "token" // Optional authentication token
}
Transport Source Configuration
HTTP Transport
{
"type": "Http",
"url": "https://api.example.com/mcp",
"timeout_secs": null, // Optional, null = no timeout (default)
"headers": { // Optional custom headers
"X-API-Version": "v1",
"User-Agent": "mistral-rs/0.6.0"
}
}
WebSocket Transport
{
"type": "WebSocket",
"url": "wss://realtime.example.com/mcp",
"timeout_secs": null, // Optional, null = no timeout (default)
"headers": { // Optional WebSocket headers
"Origin": "https://mistral.rs",
"Sec-WebSocket-Protocol": "mcp"
}
}
Process Transport
{
"type": "Process",
"command": "mcp-server-filesystem",
"args": ["--root", "/tmp"], // Command arguments
"work_dir": "/home/user", // Optional working directory
"env": { // Optional environment variables
"MCP_LOG_LEVEL": "info"
}
}
Field Reference
McpClientConfig Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
servers | Array | Yes | - | List of MCP server configurations |
auto_register_tools | Boolean | No | true | Automatically discover and register tools at startup |
tool_timeout_secs | Integer | No | null | Timeout in seconds for individual tool calls (null = no timeout) |
max_concurrent_calls | Integer | No | 1 | Maximum number of concurrent tool executions |
McpServerConfig Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
id | String | No | UUID | Unique identifier for the server (UUID generated if not provided) |
name | String | Yes | - | Human-readable server name |
source | Object | Yes | - | Transport configuration |
enabled | Boolean | No | true | Whether to connect to this server |
tool_prefix | String | No | UUID-based | Prefix to add to all tool names (UUID-based if not provided) |
resources | Array | No | None | Resource URI patterns to subscribe to |
bearer_token | String | No | None | Bearer token for authentication |
Transport Source Fields
HTTP Source
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type | String | Yes | - | Must be “Http” |
url | String | Yes | - | HTTP/HTTPS URL of the MCP server |
timeout_secs | Integer | No | null | Request timeout in seconds (null = no timeout) |
headers | Object | No | None | Additional HTTP headers |
WebSocket Source
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type | String | Yes | - | Must be “WebSocket” |
url | String | Yes | - | WS/WSS URL of the MCP server |
timeout_secs | Integer | No | null | Connection timeout in seconds (null = no timeout) |
headers | Object | No | None | WebSocket handshake headers |
Process Source
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type | String | Yes | - | Must be “Process” |
command | String | Yes | - | Executable command to run |
args | Array | No | [] | Command line arguments |
work_dir | String | No | Current dir | Working directory |
env | Object | No | None | Environment variables |
Authentication
Bearer Token
The bearer_token field is automatically added as an Authorization: Bearer <token> header for HTTP and WebSocket connections.
{
"bearer_token": "hf_AbCdEfGhIjKlMnOpQrStUvWxYz"
}
Custom Headers
For other authentication schemes, use the headers field:
{
"source": {
"type": "Http",
"url": "https://api.example.com/mcp",
"headers": {
"X-API-Key": "your-api-key",
"X-Client-ID": "your-client-id"
}
}
}
Tool Naming
Without Prefix
Tools are registered with their original names:
- MCP tool:
search→ Registered as:search
With Prefix
When tool_prefix is set, all tools from that server get prefixed:
- MCP tool:
searchwith prefixweb→ Registered as:web_search
This prevents conflicts when multiple servers provide tools with the same name.
Resource Patterns
The resources field accepts glob-like patterns:
{
"resources": [
"file://**/*.txt", // All .txt files
"file://data/**", // Everything under data/
"db://users/*", // All user records
"api://v1/metrics" // Specific endpoint
]
}
Environment Variables
Using Environment Variables in Configuration
While JSON doesn’t support environment variables directly, you can use them when building configurations programmatically:
#![allow(unused)]
fn main() {
McpServerConfig {
bearer_token: std::env::var("HF_TOKEN").ok(),
source: McpServerSource::Http {
url: std::env::var("MCP_SERVER_URL")
.unwrap_or_else(|_| "https://hf.co/mcp".to_string()),
// ...
},
// ...
}
}
import os
McpServerConfigPy(
bearer_token=os.getenv("HF_TOKEN"),
source=McpServerSourcePy.Http(
url=os.getenv("MCP_SERVER_URL", "https://hf.co/mcp")
)
)
MCP-Related Environment Variables
| Variable | Description |
|---|---|
MCP_CONFIG_PATH | Path to MCP configuration file |
MCP_LOG_LEVEL | Logging level for MCP operations |
MCP_POOL_SIZE | Connection pool size for HTTP/WebSocket |
Validation Rules
- Unique Server IDs: All server
idvalues must be unique - Valid URLs: HTTP URLs must start with
http://orhttps:// - Valid WebSocket URLs: Must start with
ws://orwss:// - Executable Commands: Process commands must be executable
- Tool Name Conflicts: Use
tool_prefixto avoid conflicts
Example Configurations
Single Server (Hugging Face) - Minimal
{
"servers": [{
"name": "Hugging Face MCP Server",
"source": {
"type": "Http",
"url": "https://hf.co/mcp"
},
"bearer_token": "hf_xxx"
}]
}
Single Server (Hugging Face) - Full Configuration
{
"servers": [{
"id": "hf",
"name": "Hugging Face MCP",
"source": {
"type": "Http",
"url": "https://hf.co/mcp",
"timeout_secs": 30
},
"enabled": true,
"tool_prefix": "hf",
"bearer_token": "hf_xxx"
}],
"auto_register_tools": true,
"tool_timeout_secs": 30,
"max_concurrent_calls": 5
}
Multi-Server Setup
{
"servers": [
{
"id": "hf",
"name": "Hugging Face",
"source": {"type": "Http", "url": "https://hf.co/mcp"},
"tool_prefix": "hf",
"bearer_token": "hf_xxx"
},
{
"id": "github",
"name": "GitHub API",
"source": {"type": "Http", "url": "https://api.github.com/mcp"},
"tool_prefix": "gh",
"bearer_token": "ghp_xxx"
},
{
"id": "local_fs",
"name": "Filesystem",
"source": {
"type": "Process",
"command": "mcp-server-filesystem",
"args": ["--root", "/data", "--readonly"]
},
"tool_prefix": "fs"
}
],
"auto_register_tools": true,
"tool_timeout_secs": 30,
"max_concurrent_calls": 10
}
MCP Transport Types
mistral.rs supports three transport types for connecting to MCP servers, each optimized for different use cases.
HTTP Transport
Best for public APIs, RESTful services, and servers behind load balancers.
Configuration
{
"source": {
"type": "Http",
"url": "https://api.example.com/mcp",
"timeout_secs": 30,
"headers": {
"X-API-Version": "v1",
"User-Agent": "mistral-rs/0.6.0"
}
},
"bearer_token": "your-api-token"
}
Features
- Server-Sent Events (SSE) support for streaming responses
- Custom headers for API versioning or client identification
- Bearer token authentication (added as
Authorization: Bearer <token>) - Configurable timeouts
- Standard HTTP semantics
Example: Hugging Face MCP
#![allow(unused)]
fn main() {
McpServerSource::Http {
url: "https://hf.co/mcp".to_string(),
timeout_secs: Some(30),
headers: None,
}
}
WebSocket Transport
Best for real-time applications, bidirectional communication, and low-latency requirements.
Configuration
{
"source": {
"type": "WebSocket",
"url": "wss://realtime.example.com/mcp",
"timeout_secs": 60,
"headers": {
"Origin": "https://mistral.rs",
"Sec-WebSocket-Protocol": "mcp"
}
},
"bearer_token": "your-websocket-token"
}
Features
- Persistent connections reduce handshake overhead
- Server-initiated notifications
- Lower latency for frequent tool calls
- Automatic reconnection handling
- WebSocket-specific headers support
Example: Real-time Data Feed
#![allow(unused)]
fn main() {
McpServerSource::WebSocket {
url: "wss://data.example.com/mcp".to_string(),
timeout_secs: Some(60),
headers: Some(headers),
}
}
Process Transport
Best for local tools, development servers, and sandboxed environments.
Configuration
{
"source": {
"type": "Process",
"command": "mcp-server-filesystem",
"args": ["--root", "/tmp", "--readonly"],
"work_dir": "/home/user/workspace",
"env": {
"MCP_LOG_LEVEL": "info",
"MCP_TIMEOUT": "30"
}
}
}
Features
- No network overhead
- Process isolation for security
- Direct stdin/stdout communication
- Environment variable configuration
- Working directory control
- No authentication needed (process inherits permissions)
Example: Filesystem Server
#![allow(unused)]
fn main() {
McpServerSource::Process {
command: "mcp-server-filesystem".to_string(),
args: vec!["--root".to_string(), "/tmp".to_string()],
work_dir: None,
env: None,
}
}
Transport Selection Guide
| Use Case | Recommended Transport | Why |
|---|---|---|
| Public APIs | HTTP | Standard auth, caching, load balancing |
| Local tools | Process | No network, process isolation |
| Real-time data | WebSocket | Low latency, server push |
| Corporate proxies | HTTP | Proxy support, standard ports |
| Development | Process | Easy debugging, no network setup |
| Interactive apps | WebSocket | Bidirectional, persistent connection |
Security Considerations
HTTP
- Always use HTTPS in production
- Bearer tokens transmitted with each request
- Consider token rotation strategies
WebSocket
- Use WSS (WebSocket Secure) in production
- Bearer token sent during handshake
- Connection persists with authenticated state
Process
- Inherits user permissions
- Sandboxing via work_dir and env
- No network exposure
Performance Tips
- HTTP: Enable keep-alive, use connection pooling
- WebSocket: Reuse connections, handle reconnection gracefully
- Process: Minimize startup time, use long-running processes
Error Handling
All transports implement automatic retry with exponential backoff:
- Initial retry: 1 second
- Max retry: 60 seconds
- Max attempts: 5
Custom retry behavior can be configured per server.
Advanced MCP Usage
This guide covers advanced MCP client configurations and usage patterns.
Multi-Server Configuration
Connect to multiple MCP servers simultaneously to access different tool sets:
#![allow(unused)]
fn main() {
let mcp_config = McpClientConfig {
servers: vec![
// Hugging Face for ML tools
McpServerConfig {
id: "hf_server".to_string(),
name: "Hugging Face MCP".to_string(),
source: McpServerSource::Http {
url: "https://hf.co/mcp".to_string(),
timeout_secs: Some(30),
headers: None,
},
enabled: true,
tool_prefix: Some("hf".to_string()),
resources: None,
bearer_token: Some("hf_xxx".to_string()),
},
// Local filesystem access
McpServerConfig {
id: "fs_server".to_string(),
name: "Filesystem MCP".to_string(),
source: McpServerSource::Process {
command: "mcp-server-filesystem".to_string(),
args: vec!["--root".to_string(), "/data".to_string()],
work_dir: None,
env: None,
},
enabled: true,
tool_prefix: Some("fs".to_string()),
resources: Some(vec!["file://**".to_string()]),
bearer_token: None,
},
// GitHub API access
McpServerConfig {
id: "github_server".to_string(),
name: "GitHub MCP".to_string(),
source: McpServerSource::Http {
url: "https://api.github.com/mcp".to_string(),
timeout_secs: Some(45),
headers: Some(HashMap::from([
("Accept".to_string(), "application/vnd.github.v3+json".to_string()),
])),
},
enabled: true,
tool_prefix: Some("gh".to_string()),
resources: None,
bearer_token: Some("ghp_xxx".to_string()),
},
],
auto_register_tools: true,
tool_timeout_secs: Some(30),
max_concurrent_calls: Some(10),
};
}
Tool Prefixing Strategy
When using multiple servers, tool prefixes prevent naming conflicts:
{
"servers": [
{
"id": "server1",
"tool_prefix": "s1",
// Tool "search" becomes "s1_search"
},
{
"id": "server2",
"tool_prefix": "s2",
// Tool "search" becomes "s2_search"
}
]
}
Custom Headers and Authentication
API Key in Headers
#![allow(unused)]
fn main() {
let mut headers = HashMap::new();
headers.insert("X-API-Key".to_string(), "your-api-key".to_string());
headers.insert("X-Client-Version".to_string(), "1.0.0".to_string());
McpServerSource::Http {
url: "https://api.example.com/mcp".to_string(),
timeout_secs: Some(30),
headers: Some(headers),
}
}
OAuth2 Bearer Token
#![allow(unused)]
fn main() {
McpServerConfig {
// ...
bearer_token: Some("your-oauth2-token".to_string()),
// Automatically added as: Authorization: Bearer your-oauth2-token
}
}
Resource Subscriptions
Subscribe to specific resource patterns from MCP servers:
#![allow(unused)]
fn main() {
McpServerConfig {
id: "data_server".to_string(),
// ...
resources: Some(vec![
"file://data/**/*.json".to_string(), // All JSON files in data/
"db://users/*".to_string(), // All user records
"api://v1/metrics".to_string(), // Specific API endpoint
]),
// ...
}
}
Concurrency and Rate Limiting
Global Concurrency Control
#![allow(unused)]
fn main() {
McpClientConfig {
// ...
max_concurrent_calls: Some(5), // Max 5 tools executing simultaneously
}
}
Per-Tool Timeouts
#![allow(unused)]
fn main() {
McpClientConfig {
// ...
tool_timeout_secs: Some(30), // Each tool call times out after 30s
}
}
Custom Rate Limiting
# Python example with custom rate limiting
import time
from collections import deque
class RateLimitedMcpRunner:
def __init__(self, runner, max_calls_per_minute=60):
self.runner = runner
self.max_calls = max_calls_per_minute
self.call_times = deque()
def send_chat_completion_request(self, request):
# Remove calls older than 1 minute
now = time.time()
while self.call_times and self.call_times[0] < now - 60:
self.call_times.popleft()
# Check rate limit
if len(self.call_times) >= self.max_calls:
sleep_time = 60 - (now - self.call_times[0])
time.sleep(sleep_time)
# Make the call
self.call_times.append(now)
return self.runner.send_chat_completion_request(request)
Environment-Specific Configuration
Development vs Production
#![allow(unused)]
fn main() {
let mcp_config = if cfg!(debug_assertions) {
McpClientConfig {
servers: vec![/* development servers */],
tool_timeout_secs: Some(60), // Longer timeouts for debugging
max_concurrent_calls: Some(1), // Sequential execution for debugging
// ...
}
} else {
McpClientConfig {
servers: vec![/* production servers */],
tool_timeout_secs: Some(10), // Strict timeouts
max_concurrent_calls: Some(20), // Higher concurrency
// ...
}
};
}
Environment Variables
#![allow(unused)]
fn main() {
let mcp_config = McpClientConfig {
servers: vec![
McpServerConfig {
// ...
bearer_token: std::env::var("HF_TOKEN").ok(),
source: McpServerSource::Http {
url: std::env::var("MCP_SERVER_URL")
.unwrap_or_else(|_| "https://hf.co/mcp".to_string()),
// ...
},
// ...
},
],
// ...
};
}
Error Handling and Fallbacks
Graceful Degradation
#![allow(unused)]
fn main() {
let mcp_config = McpClientConfig {
servers: vec![
// Primary server
McpServerConfig {
id: "primary".to_string(),
enabled: true,
// ...
},
// Fallback server
McpServerConfig {
id: "fallback".to_string(),
enabled: check_primary_health().is_err(),
// ...
},
],
// ...
};
}
Tool-Specific Error Handling
# Handle specific tool errors
try:
response = runner.send_chat_completion_request(request)
except Exception as e:
if "tool_timeout" in str(e):
print("Tool execution timed out, trying with longer timeout...")
# Retry with extended timeout
elif "tool_not_found" in str(e):
print("Tool not available, falling back to built-in response...")
# Fallback logic
Monitoring and Debugging
Enable Debug Logging
#![allow(unused)]
fn main() {
std::env::set_var("RUST_LOG", "mistralrs_mcp=debug");
env_logger::init();
}
Tool Call Inspection
#![allow(unused)]
fn main() {
let response = model.send_chat_request(messages).await?;
// Check if tools were called
if let Some(tool_calls) = &response.choices[0].message.tool_calls {
for call in tool_calls {
println!("Tool: {}", call.function.name);
println!("Args: {}", call.function.arguments);
println!("ID: {}", call.id);
}
}
}
Performance Optimization
Connection Pooling
HTTP and WebSocket transports automatically use connection pooling. Configure pool size:
#![allow(unused)]
fn main() {
// Set via environment variable
std::env::set_var("MCP_POOL_SIZE", "10");
}
Caching Tool Responses
from functools import lru_cache
import json
@lru_cache(maxsize=100)
def cached_tool_call(tool_name, args_json):
args = json.loads(args_json)
# Tool execution logic
return result
# Use with MCP tools that have deterministic outputs
Security Best Practices
- Token Rotation: Implement automatic token refresh for long-running applications
- Least Privilege: Only enable required tools and resources
- Audit Logging: Log all tool calls for security monitoring
- Network Isolation: Use Process transport for sensitive local operations
- Input Validation: MCP servers should validate all tool inputs
Configuration Reference
This document covers environment variables and server configuration for mistral.rs.
Runtime Environment Variables
| Variable | Description |
|---|---|
MISTRALRS_DEBUG=1 | Enable debug mode: outputs tensor info files for GGUF/GGML models, increases logging verbosity |
MISTRALRS_NO_MMAP=1 | Disable memory-mapped file loading, forcing all tensor data into memory |
MISTRALRS_NO_MLA=1 | Disable MLA (Multi-head Latent Attention) optimization for DeepSeek V2/V3 and GLM-4.7-Flash |
MISTRALRS_ISQ_SINGLETHREAD=1 | Force ISQ (In-Situ Quantization) to run single-threaded |
MCP_CONFIG_PATH | Fallback path for MCP client configuration (used if --mcp-config not provided) |
KEEP_ALIVE_INTERVAL | SSE keep-alive interval in milliseconds (default: 10000) |
HF_HUB_CACHE | Override Hugging Face Hub cache directory |
Build-Time Environment Variables
| Variable | Description |
|---|---|
MISTRALRS_METAL_PRECOMPILE=0 | Skip Metal kernel precompilation (useful for CI) |
NVCC_CCBIN | Set CUDA compiler path |
CUDA_NVCC_FLAGS=-fPIE | Required on some Linux distributions |
CUDA_COMPUTE_CAP | Override CUDA compute capability (e.g., “80” for RTX 3090) |
Server Defaults
When running the HTTP server with mistralrs serve, these defaults apply:
| Setting | Default Value |
|---|---|
| Server IP | 0.0.0.0 (all interfaces) |
| Max request body | 50 MB |
| Max running sequences | 16 |
| Prefix cache count | 16 |
| SSE keep-alive | 10 seconds |
| PagedAttention (CUDA) | Enabled |
| PagedAttention (Metal) | Disabled |
| PA GPU memory usage | 90% of free memory |
| PA block size | 32 tokens |
Multi-Node Distributed Configuration
For multi-node setups, configure the head node and workers using environment variables.
Head Node
| Variable | Description |
|---|---|
MISTRALRS_MN_GLOBAL_WORLD_SIZE | Total number of devices across all nodes |
MISTRALRS_MN_HEAD_NUM_WORKERS | Number of worker nodes |
MISTRALRS_MN_HEAD_PORT | Port for head node communication |
Worker Nodes
| Variable | Description |
|---|---|
MISTRALRS_MN_WORKER_SERVER_ADDR | Address of head server to connect to |
MISTRALRS_MN_WORKER_ID | This worker’s ID |
MISTRALRS_MN_LOCAL_WORLD_SIZE | Number of GPUs on this node |
MISTRALRS_NO_NCCL=1 | Disable NCCL (use alternative backend) |
See Also
- CLI Reference - Command-line options
- CLI TOML Configuration - File-based configuration
- Distributed Inference - Multi-node setup guide
- PagedAttention - Memory management options
Engine Internals
This document describes internal engine behaviors in mistral.rs.
Overview
The mistral.rs engine manages model inference through a background thread pool. Each loaded model runs in its own engine thread, which handles request queuing, batching, and execution.
Warmup Run
When a text or vision model is loaded in a multi-threaded runtime, mistral.rs automatically performs a warmup (“dummy”) run:
- Sends a short completion request (“hello” with max 1 token) to initialize CUDA kernels and caches
- Logs “Beginning dummy run.” when starting and “Dummy run completed in Xs.” when finished
- Helps ensure more consistent performance for the first real user request
- Only runs for text and vision models (not diffusion/speech)
This warmup ensures that CUDA kernel compilation and memory allocation happens during model loading rather than during the first user request.
Automatic Engine Recovery
If the inference engine thread dies unexpectedly (e.g., due to a panic), mistral.rs can automatically recover:
- Detects dead engine threads when sending requests
- Automatically reboots the engine using saved configuration
- Logs “Engine {model_id} is dead, rebooting” followed by “Successfully rebooted engine {model_id}”
- Preserves all original configuration including KV cache settings, prefix cache, and tool callbacks
This ensures high availability without manual intervention.
Thread Model
Each model loaded in mistral.rs runs in its own dedicated engine thread:
- Main Thread: Handles HTTP requests, CLI interaction, and dispatches work to engine threads
- Engine Threads: Each loaded model has a dedicated thread for inference
- Background Workers: Tokenization and other preprocessing can run in parallel
For multi-model setups, each model gets its own engine thread, allowing true parallel inference across different models.
See Also
- Multi-Model Support - Load and manage multiple models
- Configuration - Environment variables affecting engine behavior
- PagedAttention - Memory management for high throughput