Introduction

Quick Links

I want to…	Go to…
Install mistral.rs	Installation Guide
Understand cargo features	Cargo Features
Run a model	CLI Reference
Use the HTTP API	HTTP Server
Fix an error	Troubleshooting
Configure environment	Configuration
Check model support	Supported Models

Getting Started

Installation Guide - Install mistral.rs on your system
Cargo Features - Complete cargo features reference
CLI Reference - Complete CLI command reference
CLI TOML Configuration - Configure via TOML files
Troubleshooting - Common issues and solutions

SDKs & APIs

Python SDK - Python package documentation
Python Installation - Python SDK installation guide
Rust SDK - Rust crate documentation
HTTP Server - OpenAI-compatible HTTP API
OpenResponses API - Stateful conversation API

Models

By Category

Supported Models - Complete model list and compatibility
Vision Models - Vision model overview
Image Generation - Diffusion models
Embeddings - Embedding model overview

Model-Specific Guides

Click to expand model guides

Text Models:

DeepSeek V2 | DeepSeek V3
Gemma 2 | Gemma 3 | Gemma 3n
GLM4 | GLM-4.7-Flash | GLM-4.7
Qwen 3 | Qwen 3 Next | SmolLM3 | GPT-OSS

Vision Models:

Idefics 2 | Idefics 3
LLaVA | Llama 3.2 Vision | Llama 4
MiniCPM-O 2.6 | Mistral 3
Phi 3.5 MoE | Phi 3.5 Vision | Phi 4 Multimodal
Qwen 2-VL | Qwen 3 VL | Qwen 3.5

Other Models:

FLUX (Diffusion) | Dia (Speech)
EmbeddingGemma | Qwen3 Embedding

Quantization & Optimization

Quantization Overview - All supported quantization methods
ISQ (In-Situ Quantization) - Quantize models at load time
UQFF Format - Pre-quantized model format | Layout
Topology - Per-layer quantization and device mapping
Importance Matrix - Improve ISQ accuracy

Adapters & Model Customization

Adapter Models - LoRA and X-LoRA support
LoRA/X-LoRA Examples
Non-Granular Scalings - X-LoRA optimization
AnyMoE - Create MoE models from dense models
MatFormer - Dynamic model sizing

Performance & Hardware

Device Mapping - Multi-GPU and CPU offloading
PagedAttention - Efficient KV cache management
Speculative Decoding - Accelerate generation with draft models
Flash Attention - Accelerated attention
MLA - Multi-head Latent Attention
Distributed Inference
- NCCL Backend
- Ring Backend

Features

Tool Calling - Function calling support
Web Search - Integrated web search
Chat Templates - Template customization
Sampling Options - Generation parameters
TOML Selector - Model selection syntax
Multi-Model Support - Load multiple models

MCP (Model Context Protocol)

MCP Client - Connect to external tools
MCP Server - Serve models over MCP
MCP Configuration
MCP Transports
MCP Advanced Usage

Reference

Configuration - Environment variables and server defaults
Engine Internals - Engine behaviors and recovery
Supported Models - Complete compatibility tables

Contributing

See the main README for contribution guidelines.

Installation Guide

Quick Install (Recommended)

The install script automatically detects your hardware (CUDA, Metal, MKL) and builds with optimal features.

Linux/macOS:

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

Windows (PowerShell):

irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

Prerequisites

Install required packages:
- OpenSSL: sudo apt install libssl-dev (Ubuntu)
- pkg-config (Linux only): sudo apt install pkg-config

Install Rust from https://rustup.rs/

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

(Optional) Set up HuggingFace authentication:
```
mistralrs login
```
Or use huggingface-cli login as documented here.

Supported Accelerators

Accelerator	Feature Flag	Additional Flags
NVIDIA GPUs (CUDA)	`cuda`	`flash-attn`, `flash-attn-v3`, `cudnn`
Apple Silicon GPU (Metal)	`metal`
CPU (Intel)	`mkl`
CPU (Apple Accelerate)	`accelerate`
Generic CPU (ARM/AVX)	none	ARM NEON / AVX enabled by default

Note for Linux users: The metal feature is macOS-only. Use --features "cuda flash-attn cudnn" for NVIDIA GPUs or --features mkl for Intel CPUs instead of --all-features.

Feature Detection

Determine which features to enable based on your hardware:

Hardware	Features
NVIDIA GPU (Ampere+, compute >=80)	`cuda cudnn flash-attn`
NVIDIA GPU (Hopper, compute 90)	`cuda cudnn flash-attn flash-attn-v3`
NVIDIA GPU (older)	`cuda cudnn`
Apple Silicon (macOS)	`metal accelerate`
Intel CPU with MKL	`mkl`
CPU only	(no features needed)

Install from crates.io

cargo install mistralrs-cli --features "<your-features>"

Example:

cargo install mistralrs-cli --features "cuda flash-attn cudnn"

Build from Source

git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo install --path mistralrs-cli --features "<your-features>"

Example:

cargo build --release --features "cuda flash-attn cudnn"

Docker

Docker images are available for quick deployment:

docker pull ghcr.io/ericlbuehler/mistral.rs:latest
docker run --gpus all -p 1234:1234 ghcr.io/ericlbuehler/mistral.rs:latest \
  serve -m Qwen/Qwen3-4B

Docker images on GitHub Container Registry

Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/

Python SDK

Install the Python package:

pip install mistralrs-cuda    # For NVIDIA GPUs
pip install mistralrs-metal   # For Apple Silicon
pip install mistralrs-mkl     # For Intel CPUs
pip install mistralrs         # CPU-only

Full installation instructions
SDK documentation

Verify Installation

After installation, verify everything works:

# Check CLI is installed
mistralrs --help

# Run system diagnostics
mistralrs doctor

# Test with a small model
mistralrs run -m Qwen/Qwen3-0.6B

Getting Models

From Hugging Face Hub (Default)

Models download automatically from Hugging Face Hub:

mistralrs run -m meta-llama/Llama-3.2-3B-Instruct

For gated models, authenticate first:

mistralrs login
# Or: mistralrs run --token-source env:HF_TOKEN -m <model>

From Local Files

Pass a path to a downloaded model:

mistralrs run -m /path/to/model

Running GGUF Models

mistralrs run --format gguf -m author/model-repo -f model-quant.gguf

Specify tokenizer if needed:

mistralrs run --format gguf -m author/model-repo -f file.gguf -t author/official-tokenizer

Next Steps

CLI Reference - All commands and options
HTTP API - Run as an OpenAI-compatible server
Python SDK - Python package documentation
Troubleshooting - Common issues and solutions

Cargo Features Reference

This document provides a complete reference for all cargo features available in mistral.rs.

Quick Reference

Feature	Description	Platform	Requires
`cuda`	NVIDIA GPU acceleration	Linux, Windows	CUDA toolkit
`cudnn`	NVIDIA cuDNN backend	Linux, Windows	`cuda`, cuDNN
`flash-attn`	FlashAttention V2	Linux, Windows	`cuda`, CC >= 8.0
`flash-attn-v3`	FlashAttention V3	Linux, Windows	`cuda`, CC >= 9.0
`metal`	Apple GPU acceleration	macOS	-
`accelerate`	Apple CPU acceleration	macOS	-
`mkl`	Intel MKL acceleration	Linux, Windows	Intel MKL
`nccl`	Multi-GPU (NVIDIA NCCL)	Linux	`cuda`, NCCL
`ring`	Multi-GPU/node (TCP ring)	All	-

GPU Acceleration Features

`cuda`

Enables NVIDIA GPU acceleration via CUDA. This is the primary feature for running on NVIDIA GPUs.

Requirements:

NVIDIA GPU
CUDA toolkit installed
Linux or Windows (WSL supported)

Usage:

cargo build --release --features cuda
cargo install mistralrs-cli --features cuda

What it enables:

GPU tensor operations via CUDA
PagedAttention on CUDA devices
Quantized inference on GPU

`cudnn`

Enables NVIDIA cuDNN for optimized neural network primitives. Provides faster convolutions and other operations.

Requirements:

cuda feature
cuDNN library installed

Usage:

cargo build --release --features "cuda cudnn"

`flash-attn`

Enables FlashAttention V2 for faster attention computation. Significantly reduces memory usage and improves throughput.

Requirements:

cuda feature (automatically enabled)
GPU with compute capability >= 8.0 (Ampere or newer)

Compatible GPUs:

Architecture	Compute Capability	Example GPUs
Ampere	8.0, 8.6	RTX 30 series, A100, A40
Ada Lovelace	8.9	RTX 40 series, L40S
Blackwell	10.0, 12.0	RTX 50 series

Usage:

cargo build --release --features "cuda flash-attn cudnn"

Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.

`flash-attn-v3`

Enables FlashAttention V3 for Hopper architecture GPUs. Provides additional performance improvements over V2 on supported hardware.

Requirements:

cuda feature (automatically enabled)
GPU with compute capability >= 9.0 (Hopper)

Compatible GPUs:

Architecture	Compute Capability	Example GPUs
Hopper	9.0	H100, H800

Usage:

cargo build --release --features "cuda flash-attn-v3 cudnn"

Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.

`metal`

Enables Apple Metal GPU acceleration for macOS devices.

Requirements:

macOS with Apple Silicon or AMD GPU
macOS only (not available on Linux)

Usage:

cargo build --release --features metal

What it enables:

GPU tensor operations via Metal
PagedAttention on Metal devices (opt-in via --paged-attn)
Quantized inference on Apple GPUs

Note: PagedAttention is disabled by default on Metal. Enable with --paged-attn flag.

CPU Acceleration Features

`accelerate`

Enables Apple’s Accelerate framework for optimized CPU operations on macOS.

Requirements:

macOS

Usage:

cargo build --release --features accelerate
# Or combined with Metal:
cargo build --release --features "metal accelerate"

`mkl`

Enables Intel Math Kernel Library (MKL) for optimized CPU operations.

Requirements:

Intel MKL installed
Intel CPU recommended (works on AMD but Intel-optimized)

Usage:

cargo build --release --features mkl

Distributed Inference Features

`nccl`

Enables multi-GPU distributed inference using NVIDIA NCCL (NVIDIA Collective Communications Library). Implements tensor parallelism for splitting large models across multiple GPUs.

Requirements:

cuda feature (automatically enabled)
Multiple NVIDIA GPUs
NCCL library
World size must be a power of 2 (1, 2, 4, 8, etc.)

Usage:

cargo build --release --features "cuda nccl"

# Run with specific GPU count
MISTRALRS_MN_LOCAL_WORLD_SIZE=2 mistralrs serve -m Qwen/Qwen3-30B-A3B-Instruct

Environment Variables:

Variable	Description
`MISTRALRS_MN_LOCAL_WORLD_SIZE`	Number of GPUs to use (defaults to all)
`MISTRALRS_NO_NCCL=1`	Disable NCCL and use device mapping instead

Multi-node setup requires additional environment variables. See NCCL documentation for details.

Note: When NCCL is enabled, automatic device mapping is disabled.

`ring`

Enables distributed tensor-parallel inference using a TCP-based ring topology. Works across multiple machines without requiring NCCL.

Requirements:

World size must be a power of 2 (2, 4, 8, etc.)
TCP ports must be open between nodes

Usage:

cargo build --release --features ring

# Configure via JSON file
export RING_CONFIG=path/to/ring_config.json
mistralrs serve -m model-id

Configuration:

Create a JSON configuration file for each process:

{
  "master_ip": "0.0.0.0",
  "master_port": 1234,
  "port": 12345,
  "right_port": 12346,
  "rank": 0,
  "world_size": 2
}

Field	Description
`master_ip`	IP address for master node
`master_port`	Port for master node
`port`	Local port for incoming connections
`right_port`	Port of right neighbor in ring
`right_ip`	IP of right neighbor (optional, defaults to localhost)
`rank`	Process rank (0 to world_size-1)
`world_size`	Total number of processes (must be power of 2)

See Ring documentation for detailed setup instructions.

Feature Combinations

Recommended Combinations by Hardware

Hardware	Recommended Features
NVIDIA Ampere+ (RTX 30/40, A100)	`cuda cudnn flash-attn`
NVIDIA Hopper (H100)	`cuda cudnn flash-attn-v3`
NVIDIA older GPUs	`cuda cudnn`
Apple Silicon	`metal accelerate`
Intel CPU	`mkl`
Generic CPU	(no features needed)
Multi-GPU NVIDIA	`cuda cudnn flash-attn nccl`
Multi-node/cross-platform	`ring` (plus GPU features)

Installation Examples

# NVIDIA GPU with all optimizations
cargo install mistralrs-cli --features "cuda cudnn flash-attn"

# Apple Silicon
cargo install mistralrs-cli --features "metal accelerate"

# Intel CPU
cargo install mistralrs-cli --features "mkl"

# Multi-GPU NVIDIA setup
cargo install mistralrs-cli --features "cuda cudnn flash-attn nccl"

# Build from source with CUDA
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo build --release --features "cuda cudnn flash-attn"

Internal Features

These features are primarily for library development and are not typically used directly:

Feature	Description
`pyo3_macros`	Python bindings support (used by mistralrs-pyo3)
`utoipa`	OpenAPI documentation generation

Python Package Features

The Python SDK is distributed as separate packages with features pre-configured:

Package	Equivalent Features
`mistralrs-cuda`	`cuda cudnn flash-attn`
`mistralrs-metal`	`metal accelerate`
`mistralrs-mkl`	`mkl`
`mistralrs`	CPU only

pip install mistralrs-cuda    # NVIDIA GPUs
pip install mistralrs-metal   # Apple Silicon
pip install mistralrs-mkl     # Intel CPUs
pip install mistralrs         # Generic CPU

Troubleshooting

Diagnosing Issues

Use mistralrs doctor to diagnose your system configuration and verify features are working correctly:

mistralrs doctor

This command checks:

Detected hardware (GPUs, CPU features)
Installed libraries (CUDA, cuDNN, etc.)
Feature compatibility
Common configuration issues

Feature not working

Run mistralrs doctor to check system configuration

Verify the feature is enabled in your build:

cargo build --release --features "your-features" -v

Check hardware compatibility (especially for flash-attn)
Ensure required libraries are installed (CUDA, cuDNN, MKL, etc.)

Conflicting features

flash-attn and flash-attn-v3 are mutually exclusive
metal is macOS-only; don’t use with cuda
nccl requires cuda

Build errors

CUDA not found: Ensure CUDA toolkit is installed and nvcc is in PATH
MKL not found: Install Intel oneAPI or standalone MKL
Metal errors on Linux: Remove metal feature (macOS only)

See Troubleshooting for more solutions.

mistralrs CLI Reference

This is the comprehensive CLI reference for mistralrs. The CLI provides commands for interactive mode, HTTP server, builtin UI, quantization, and system diagnostics.

Commands
- run: run model in interactive mode
- serve: start HTTP/MCP server and (optionally) the UI
- from-config: run from a TOML configuration file
- quantize: generate UQFF quantized model file
- tune: recommend quantization + device mapping for a model
- doctor: run system diagnostics and environment checks
- login: authenticate with HuggingFace Hub
- cache: manage the HuggingFace model cache
- bench: run performance benchmarks
- completions: generate shell completions
Model Types
- auto
- text
- vision
- diffusion
- speech
- embedding
Features
Global Options
Interactive Commands

Commands

run - Interactive Mode

Start a model in interactive mode for conversational use.

mistralrs run [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]

Note: MODEL_TYPE is optional and defaults to auto if not specified. This allows a shorter syntax.

Examples:

# Run a text model interactively (shorthand - auto type is implied)
mistralrs run -m Qwen/Qwen3-4B

# Explicit auto type (equivalent to above)
mistralrs run -m Qwen/Qwen3-4B

# Run with thinking mode enabled
mistralrs run -m Qwen/Qwen3-4B --enable-thinking

# Run a vision model
mistralrs run -m google/gemma-3-4b-it

Options:

Option	Description
`--enable-thinking`	Enable thinking mode for models that support it

The run command also accepts all runtime options.

serve - HTTP Server

Start an HTTP server with OpenAI-compatible API endpoints.

mistralrs serve [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]

Note: MODEL_TYPE is optional and defaults to auto if not specified.

Examples:

# Start server on default port 1234 (shorthand)
mistralrs serve -m Qwen/Qwen3-4B

# Explicit auto type (equivalent to above)
mistralrs serve -m Qwen/Qwen3-4B

# Start server with web UI
mistralrs serve -m Qwen/Qwen3-4B --ui

# Start server on custom port
mistralrs serve -m Qwen/Qwen3-4B -p 3000

# Start server with MCP support
mistralrs serve -m Qwen/Qwen3-4B --mcp-port 8081

Server Options:

Option	Default	Description
`-p, --port <PORT>`	`1234`	HTTP server port
`--host <HOST>`	`0.0.0.0`	Bind address
`--ui`	disabled	Serve built-in web UI at `/ui`
`--mcp-port <PORT>`	none	MCP protocol server port
`--mcp-config <PATH>`	none	MCP client configuration file

The serve command also accepts all runtime options.

quantize - UQFF Generation

Generate UQFF (Unified Quantized File Format) files from a model. Supports multiple quantization types in a single command.

mistralrs quantize <MODEL_TYPE> -m <MODEL_ID> --isq <LEVEL>[,<LEVEL>...] -o <OUTPUT>

Examples:

# Quantize to a single type (file output)
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-uqff/qwen3-4b-q4k.uqff

# Quantize to a single type (directory output, auto-named)
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-uqff/

# Quantize to multiple types at once (directory output)
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k,q8_0 -o qwen3-4b-uqff/

# Equivalent: repeated --isq flags
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k --isq q8_0 -o qwen3-4b-uqff/

# Quantize a vision model
mistralrs quantize -m google/gemma-3-4b-it --isq 4 -o gemma3-4b-uqff/

# Quantize with imatrix for better quality
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k --imatrix imatrix.dat -o qwen3-4b-uqff/qwen3-4b-q4k.uqff

When using directory output mode, the quantize command automatically:

Generates a README.md model card with Hugging Face frontmatter and example commands
Prints the huggingface-cli upload command to upload your UQFF to Hugging Face

Quantize Options:

Option	Required	Description
`-m, --model-id <ID>`	Yes	Model ID or local path
`--isq <LEVEL>`	Yes	Quantization level(s), comma-separated or repeated (see ISQ Quantization)
`-o, --output <PATH>`	Yes	Output path: `.uqff` file (single ISQ) or directory (auto-named per ISQ type)
`--isq-organization <TYPE>`	No	ISQ organization strategy: `default` or `moqe`
`--imatrix <PATH>`	No	imatrix file for enhanced quantization
`--calibration-file <PATH>`	No	Calibration file for imatrix generation
`--no-readme`	No	Skip automatic README.md model card generation

Get quantization and device mapping recommendations for a model. The tune command analyzes your hardware and shows all quantization options with their estimated memory usage, context room, and quality trade-offs.

mistralrs tune [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]

Note: MODEL_TYPE is optional and defaults to auto if not specified, which supports all model types. See details.

Examples:

# Get balanced recommendations (shorthand)
mistralrs tune -m Qwen/Qwen3-4B

# Get quality-focused recommendations
mistralrs tune -m Qwen/Qwen3-4B --profile quality

# Get fast inference recommendations
mistralrs tune -m Qwen/Qwen3-4B --profile fast

# Output as JSON
mistralrs tune -m Qwen/Qwen3-4B --json

# Generate a TOML config file with recommendations
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml

Example Output (CUDA):

Tuning Analysis
===============

Model: Qwen/Qwen3-4B
Profile: Balanced
Backend: cuda
Total VRAM: 24.0 GB

Quantization Options
--------------------
┌─────────────┬───────────┬────────┬──────────────┬───────────────┬──────────────────┐
│ Quant       │ Est. Size │ VRAM % │ Context Room │ Quality       │ Status           │
├─────────────┼───────────┼────────┼──────────────┼───────────────┼──────────────────┤
│ None (FP16) │ 8.50 GB   │ 35%    │ 48k          │ Baseline      │ ✅ Fits          │
│ Q8_0        │ 4.50 GB   │ 19%    │ 96k          │ Near-lossless │ 🚀 Recommended   │
│ Q6K         │ 3.70 GB   │ 15%    │ 128k (max)   │ Good          │ ✅ Fits          │
│ Q5K         │ 3.20 GB   │ 13%    │ 128k (max)   │ Good          │ ✅ Fits          │
│ Q4K         │ 2.60 GB   │ 11%    │ 128k (max)   │ Acceptable    │ ✅ Fits          │
│ Q3K         │ 2.00 GB   │ 8%     │ 128k (max)   │ Degraded      │ ✅ Fits          │
│ Q2K         │ 1.50 GB   │ 6%     │ 128k (max)   │ Degraded      │ ✅ Fits          │
└─────────────┴───────────┴────────┴──────────────┴───────────────┴──────────────────┘

Recommended Command
-------------------
  mistralrs serve -m Qwen/Qwen3-4B --isq q8_0

[INFO] PagedAttention is available (mode: auto)

Example Output (Metal):

On macOS with Metal, the command recommends Apple Format Quantization (AFQ) types:

Quantization Options
--------------------
┌─────────────┬───────────┬────────┬──────────────┬───────────────┬──────────────────┐
│ Quant       │ Est. Size │ VRAM % │ Context Room │ Quality       │ Status           │
├─────────────┼───────────┼────────┼──────────────┼───────────────┼──────────────────┤
│ None (FP16) │ 8.50 GB   │ 53%    │ 24k          │ Baseline      │ ✅ Fits          │
│ AFQ8        │ 4.50 GB   │ 28%    │ 56k          │ Near-lossless │ 🚀 Recommended   │
│ AFQ6        │ 3.70 GB   │ 23%    │ 64k          │ Good          │ ✅ Fits          │
│ AFQ4        │ 2.60 GB   │ 16%    │ 128k (max)   │ Acceptable    │ ✅ Fits          │
│ AFQ3        │ 2.00 GB   │ 13%    │ 128k (max)   │ Degraded      │ ✅ Fits          │
│ AFQ2        │ 1.50 GB   │ 9%     │ 128k (max)   │ Degraded      │ ✅ Fits          │
└─────────────┴───────────┴────────┴──────────────┴───────────────┴──────────────────┘

Status Legend:

🚀 Recommended: Best option for your profile and hardware
✅ Fits: Model fits entirely in GPU memory
⚠️ Hybrid: Model requires CPU offloading (slower due to PCIe bottleneck)
❌ Too Large: Model doesn’t fit even with CPU offload

Tune Options:

Option	Default	Description
`--profile <PROFILE>`	`balanced`	Tuning profile: `quality`, `balanced`, or `fast`
`--json`	disabled	Output JSON instead of human-readable text
`--emit-config <PATH>`	none	Emit a TOML config file with recommended settings

doctor - System Diagnostics

Run comprehensive system diagnostics and environment checks. The doctor command helps identify configuration issues and validates your system is ready for inference.

mistralrs doctor [OPTIONS]

Examples:

# Run diagnostics
mistralrs doctor

# Output as JSON
mistralrs doctor --json

Checks Performed:

CPU Extensions: AVX, AVX2, AVX-512, FMA support (x86 only; ARM shows NEON)
Binary/Hardware Match: Validates CUDA/Metal features match detected hardware
GPU Compute Capability: Reports compute version and Flash Attention v2/v3 compatibility
Flash Attention Features: Warns if hardware supports FA but binary doesn’t have it enabled
Hugging Face Connectivity: Tests connection and token validity using a gated model
HF Cache: Verifies cache directory is writable
Disk Space: Checks available storage

Options:

Option	Description
`--json`	Output JSON instead of human-readable text

Authenticate with HuggingFace Hub by saving your token to the local cache.

mistralrs login [OPTIONS]

Examples:

# Interactive login (prompts for token)
mistralrs login

# Provide token directly
mistralrs login --token hf_xxxxxxxxxxxxx

The token is saved to the standard HuggingFace cache location:

Linux/macOS: ~/.cache/huggingface/token
Windows: C:\Users\<user>\.cache\huggingface\token

If the HF_HOME environment variable is set, the token is saved to $HF_HOME/token.

Options:

Option	Description
`--token <TOKEN>`	Provide token directly (non-interactive)

cache - Model Management

Manage the HuggingFace model cache. List cached models or delete specific models.

mistralrs cache <SUBCOMMAND>

Subcommands:

cache list

List all cached models with their sizes and last used times.

mistralrs cache list

Example output:

HuggingFace Model Cache
-----------------------

┌──────────────────────────┬──────────┬─────────────┐
│ Model                    │ Size     │ Last Used   │
├──────────────────────────┼──────────┼─────────────┤
│ Qwen/Qwen3-4B            │ 8.5 GB   │ today       │
│ google/gemma-3-4b-it     │ 6.2 GB   │ 2 days ago  │
│ meta-llama/Llama-3.2-3B  │ 5.8 GB   │ 1 week ago  │
└──────────────────────────┴──────────┴─────────────┘

Total: 3 models, 20.5 GB
Cache directory: /home/user/.cache/huggingface/hub

cache delete

Delete a specific model from the cache.

mistralrs cache delete -m <MODEL_ID>

Examples:

# Delete a specific model
mistralrs cache delete -m Qwen/Qwen3-4B

# Delete a model with organization
mistralrs cache delete -m meta-llama/Llama-3.2-3B

bench - Performance Benchmarking

Run performance benchmarks to measure prefill and decode speeds.

mistralrs bench [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]

Note: MODEL_TYPE is optional and defaults to auto if not specified.

Examples:

# Run default benchmark (512 prompt tokens, 128 generated tokens, 3 iterations)
mistralrs bench -m Qwen/Qwen3-4B

# Custom prompt and generation lengths
mistralrs bench -m Qwen/Qwen3-4B --prompt-len 1024 --gen-len 256

# More iterations for better statistics
mistralrs bench -m Qwen/Qwen3-4B --iterations 10

# With ISQ quantization
mistralrs bench -m Qwen/Qwen3-4B --isq q4k

Example output:

Benchmark Results
=================

Model: Qwen/Qwen3-4B
Iterations: 3

┌────────────────────────┬─────────────────┬─────────────────┐
│ Test                   │ T/s             │ Latency         │
├────────────────────────┼─────────────────┼─────────────────┤
│ Prefill (512 tokens)   │ 2847.3 ± 45.2   │ 179.82 ms (TTFT)│
│ Decode (128 tokens)    │ 87.4 ± 2.1      │ 11.44 ms/T      │
└────────────────────────┴─────────────────┴─────────────────┘

T/s: Tokens per second (throughput)
Latency: For prefill, shows TTFT (Time To First Token) in milliseconds. For decode, shows ms per token.

Options:

Option	Default	Description
`--prompt-len <N>`	`512`	Number of tokens in prompt (prefill test)
`--gen-len <N>`	`128`	Number of tokens to generate (decode test)
`--iterations <N>`	`3`	Number of benchmark iterations
`--warmup <N>`	`1`	Number of warmup runs (discarded)

The bench command also accepts all model loading options (ISQ, device mapping, etc.).

from-config - TOML Configuration

Run the CLI from a TOML configuration file. This is the recommended way to run multiple models simultaneously, including models of different types (e.g., text + vision + embedding).

See CLI_CONFIG.md for full TOML configuration format details.

mistralrs from-config --file <PATH>

Example:

mistralrs from-config --file config.toml

Multi-model example (config.toml):

command = "serve"

[server]
port = 1234
ui = true

[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"

[[models]]
kind = "vision"
model_id = "google/gemma-3-4b-it"

[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"

completions - Shell Completions

Generate shell completions for your shell.

mistralrs completions <SHELL>

Examples:

# Generate bash completions
mistralrs completions bash > ~/.local/share/bash-completion/completions/mistralrs

# Generate zsh completions
mistralrs completions zsh > ~/.zfunc/_mistralrs

# Generate fish completions
mistralrs completions fish > ~/.config/fish/completions/mistralrs.fish

Supported Shells: bash, zsh, fish, elvish, powershell

Model Types

auto

Auto-detect model type. This is the recommended option for most models and is on by default simply by leaving out the explicit model type.

mistralrs run -m Qwen/Qwen3-4B
mistralrs serve -m Qwen/Qwen3-4B

The auto type supports text, vision, and other model types through automatic detection.

text

Explicit text generation model configuration.

mistralrs run text -m Qwen/Qwen3-4B
mistralrs serve text -m Qwen/Qwen3-4B

vision

Vision-language models that can process images and text.

mistralrs run vision -m google/gemma-3-4b-it
mistralrs serve vision -m google/gemma-3-4b-it

Vision Options:

Option	Description
`--max-edge <SIZE>`	Maximum edge length for image resizing (aspect ratio preserved)
`--max-num-images <N>`	Maximum number of images per request
`--max-image-length <SIZE>`	Maximum image dimension for device mapping

diffusion

Image generation models using diffusion.

mistralrs run diffusion -m black-forest-labs/FLUX.1-schnell
mistralrs serve diffusion -m black-forest-labs/FLUX.1-schnell

speech

Speech synthesis models.

mistralrs run speech -m nari-labs/Dia-1.6B
mistralrs serve speech -m nari-labs/Dia-1.6B

embedding

Text embedding models. These do not support interactive mode but can be used with the HTTP server.

mistralrs serve embedding -m google/embeddinggemma-300m

Features

ISQ Quantization

In-situ quantization (ISQ) reduces model memory usage by quantizing weights at load time. See details about ISQ here.

Usage:

# Simple bit-width quantization
mistralrs run -m Qwen/Qwen3-4B --isq 4
mistralrs run -m Qwen/Qwen3-4B --isq 8

# GGML-style quantization
mistralrs run -m Qwen/Qwen3-4B --isq q4_0
mistralrs run -m Qwen/Qwen3-4B --isq q4_1
mistralrs run -m Qwen/Qwen3-4B --isq q4k
mistralrs run -m Qwen/Qwen3-4B --isq q5k
mistralrs run -m Qwen/Qwen3-4B --isq q6k

ISQ Organization:

# Use MOQE organization for potentially better quality
mistralrs run -m Qwen/Qwen3-4B --isq q4k --isq-organization moqe

UQFF Files

UQFF (Unified Quantized File Format) provides pre-quantized model files for faster loading.

Generate UQFF files:

mistralrs quantize -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-uqff/

Load from UQFF:

# Specify just the first shard -- remaining shards are auto-discovered
mistralrs run -m Qwen/Qwen3-4B --from-uqff q4k-0.uqff

Multiple UQFF files (semicolon-separated, for different quantizations in one load):

mistralrs run -m Qwen/Qwen3-4B --from-uqff "q4k-0.uqff;q8_0-0.uqff"

Note: Shard auto-discovery means you no longer need to list every shard file. Specifying q4k-0.uqff will automatically find q4k-1.uqff, q4k-2.uqff, etc.

PagedAttention

PagedAttention enables efficient memory management for the KV cache. It is automatically enabled on CUDA and disabled on Metal/CPU by default.

Control PagedAttention:

# Auto mode (default): enabled on CUDA, disabled on Metal/CPU
mistralrs serve -m Qwen/Qwen3-4B --paged-attn auto

# Force enable
mistralrs serve -m Qwen/Qwen3-4B --paged-attn on

# Force disable
mistralrs serve -m Qwen/Qwen3-4B --paged-attn off

Memory allocation options (mutually exclusive):

# Allocate for specific context length (recommended)
mistralrs serve -m Qwen/Qwen3-4B --pa-context-len 8192

# Allocate specific GPU memory in MB
mistralrs serve -m Qwen/Qwen3-4B --pa-memory-mb 4096

# Allocate fraction of GPU memory (0.0-1.0)
mistralrs serve -m Qwen/Qwen3-4B --pa-memory-fraction 0.8

Additional options:

Option	Description
`--pa-block-size <SIZE>`	Tokens per block (default: 32 on CUDA)
`--pa-cache-type <TYPE>`	KV cache quantization type (default: auto)

Device Mapping

Control how model layers are distributed across devices.

Automatic mapping:

# Use defaults (automatic)
mistralrs run -m Qwen/Qwen3-4B

Manual layer assignment:

# Assign 10 layers to GPU 0, 20 layers to GPU 1
mistralrs run -m Qwen/Qwen3-4B -n "0:10;1:20"

# Equivalent long form
mistralrs run -m Qwen/Qwen3-4B --device-layers "0:10;1:20"

CPU-only execution:

mistralrs run -m Qwen/Qwen3-4B --cpu

Topology file:

mistralrs run -m Qwen/Qwen3-4B --topology topology.yaml

Custom HuggingFace cache:

mistralrs run -m Qwen/Qwen3-4B --hf-cache /path/to/cache

Device mapping options:

Option	Default	Description
`-n, --device-layers <MAPPING>`	auto	Device layer mapping (format: `ORD:NUM;...`)
`--topology <PATH>`	none	Topology YAML file for device mapping
`--hf-cache <PATH>`	none	Custom HuggingFace cache directory
`--cpu`	disabled	Force CPU-only execution
`--max-seq-len <LEN>`	`4096`	Max sequence length for automatic device mapping
`--max-batch-size <SIZE>`	`1`	Max batch size for automatic device mapping

LoRA and X-LoRA

Apply LoRA or X-LoRA adapters to models.

LoRA:

# Single LoRA adapter
mistralrs run -m Qwen/Qwen3-4B --lora my-lora-adapter

# Multiple LoRA adapters (semicolon-separated)
mistralrs run -m Qwen/Qwen3-4B --lora "adapter1;adapter2"

X-LoRA:

# X-LoRA adapter with ordering file
mistralrs run -m Qwen/Qwen3-4B --xlora my-xlora-adapter --xlora-order ordering.json

# With target non-granular index
mistralrs run -m Qwen/Qwen3-4B --xlora my-xlora-adapter --xlora-order ordering.json --tgt-non-granular-index 2

Chat Templates

Override the model’s default chat template.

Use a template file:

# JSON template file
mistralrs run -m Qwen/Qwen3-4B --chat-template template.json

# Jinja template file
mistralrs run -m Qwen/Qwen3-4B --chat-template template.jinja

Explicit Jinja override:

mistralrs run -m Qwen/Qwen3-4B --jinja-explicit custom.jinja

Web Search

Enable web search capabilities (requires an embedding model).

# Enable search with default embedding model
mistralrs run -m Qwen/Qwen3-4B --enable-search

# Specify embedding model
mistralrs run -m Qwen/Qwen3-4B --enable-search --search-embedding-model embedding-gemma

Thinking Mode

Enable thinking/reasoning mode for models that support it (like DeepSeek, Qwen3).

mistralrs run -m Qwen/Qwen3-4B --enable-thinking

In interactive mode, thinking content is displayed in gray text before the final response.

Global Options

These options apply to all commands.

Option	Default	Description
`--seed <SEED>`	none	Random seed for reproducibility
`-l, --log <PATH>`	none	Log all requests and responses to file
`--token-source <SOURCE>`	`cache`	HuggingFace authentication token source
`-V, --version`	N/A	Print version information and exit
`-h, --help`	N/A	Print help message (use with any subcommand)

Token source formats:

cache - Use cached HuggingFace token (default)
literal:<token> - Use literal token value
env:<var> - Read token from environment variable
path:<file> - Read token from file
none - No authentication

Examples:

# Set random seed
mistralrs run -m Qwen/Qwen3-4B --seed 42

# Enable logging
mistralrs run -m Qwen/Qwen3-4B --log requests.log

# Use token from environment variable
mistralrs run -m meta-llama/Llama-3.2-3B-Instruct --token-source env:HF_TOKEN

Runtime Options

These options are available for both run and serve commands.

Option	Default	Description
`--max-seqs <N>`	`32`	Maximum concurrent sequences
`--no-kv-cache`	disabled	Disable KV cache entirely
`--prefix-cache-n <N>`	`16`	Number of prefix caches to hold (0 to disable)
`-c, --chat-template <PATH>`	none	Custom chat template file (.json or .jinja)
`-j, --jinja-explicit <PATH>`	none	Explicit JINJA template override
`--enable-search`	disabled	Enable web search
`--search-embedding-model <MODEL>`	none	Embedding model for search

Model Source Options

These options are common across model types.

Option	Description
`-m, --model-id <ID>`	HuggingFace model ID or local path (required)
`-t, --tokenizer <PATH>`	Path to local tokenizer.json file
`-a, --arch <ARCH>`	Model architecture (auto-detected if not specified)
`--dtype <TYPE>`	Model data type (default: `auto`)

Format Options

For loading quantized models.

Option	Description
`--format <FORMAT>`	Model format: `plain`, `gguf`, or `ggml` (auto-detected)
`-f, --quantized-file <FILE>`	Quantized model filename(s) for GGUF/GGML (semicolon-separated)
`--tok-model-id <ID>`	Model ID for tokenizer when using quantized format
`--gqa <VALUE>`	GQA value for GGML models (default: 1)

Examples:

# Load a GGUF model
mistralrs run -m Qwen/Qwen3-4B --format gguf -f model.gguf

# Multiple GGUF files
mistralrs run -m Qwen/Qwen3-4B --format gguf -f "model-part1.gguf;model-part2.gguf"

Interactive Commands

When running in interactive mode (mistralrs run), the following commands are available:

Command	Description
`\help`	Display help message
`\exit`	Quit interactive mode
`\system <message>`	Add a system message without running the model
`\clear`	Clear the chat history
`\temperature <float>`	Set sampling temperature (0.0 to 2.0)
`\topk <int>`	Set top-k sampling value (>0)
`\topp <float>`	Set top-p sampling value (0.0 to 1.0)

Examples:

> \system Always respond as a pirate.
> \temperature 0.7
> \topk 50
> Hello!
Ahoy there, matey! What brings ye to these waters?
> \clear
> \exit

Vision Model Interactive Mode:

For vision models, you can include images in your prompts by specifying file paths or URLs:

> Describe this image: /path/to/image.jpg
> Compare these images: image1.png image2.png
> Describe the image and transcribe the audio: photo.jpg recording.mp3

Note: The CLI automatically detects paths to supported image and audio files within your prompt. You do not need special syntax; simply paste the absolute or relative path to the file.

Supported image formats: PNG, JPEG, BMP, GIF, WebP Supported audio formats: WAV, MP3, FLAC, OGG

mistralrs-cli TOML Config

mistralrs-cli can run entirely from a single TOML configuration file. This config supports multiple models and mirrors the CLI options.

Usage

mistralrs from-config --file path/to/config.toml

Quick Example

command = "serve"

[server]
port = 1234
ui = true

[runtime]
max_seqs = 32

[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"

[models.quantization]
in_situ_quant = "q4k"

Complete Reference

Top-Level Options

Option	Commands	Description
`command`	all	Required. Either `"serve"` or `"run"`
`enable_thinking`	run	Enable thinking mode (default: false)
`default_model_id`	serve	Default model ID for API requests (must match a model_id in [[models]])

[global] Section

Global options that apply to the entire run.

Option	Default	Description
`seed`	none	Random seed for reproducibility
`log`	none	Log all requests/responses to this file path
`token_source`	`"cache"`	HuggingFace auth: `"cache"`, `"none"`, `"literal:<token>"`, `"env:<var>"`, `"path:<file>"`

[server] Section (serve only)

HTTP server configuration.

Option	Default	Description
`port`	`1234`	HTTP server port
`host`	`"0.0.0.0"`	Bind address
`ui`	`false`	Serve built-in web UI at `/ui`
`mcp_port`	none	MCP protocol server port (enables MCP if set)
`mcp_config`	none	MCP client configuration file path

[runtime] Section

Runtime inference options.

Option	Default	Description
`max_seqs`	`32`	Maximum concurrent sequences
`no_kv_cache`	`false`	Disable KV cache entirely
`prefix_cache_n`	`16`	Number of prefix caches to hold (0 to disable)
`chat_template`	none	Custom chat template file (.json or .jinja)
`jinja_explicit`	none	Explicit JINJA template override
`enable_search`	`false`	Enable web search
`search_embedding_model`	none	Embedding model for search (e.g., `"embedding-gemma"`)

[paged_attn] Section

PagedAttention configuration.

Option	Default	Description
`mode`	`"auto"`	`"auto"` (CUDA on, Metal off), `"on"`, or `"off"`
`context_len`	none	Allocate KV cache for this context length
`memory_mb`	none	GPU memory to allocate in MB (conflicts with context_len)
`memory_fraction`	none	GPU memory utilization 0.0-1.0 (conflicts with above)
`block_size`	`32`	Tokens per block
`cache_type`	`"auto"`	KV cache type

Note: If none of context_len, memory_mb, or memory_fraction are specified, defaults to 90% of available VRAM. Each are mutually exclusive.

[[models]] Section

Define one or more models. Each [[models]] entry creates a new model.

Top-Level Model Options

Option	Required	Description
`kind`	yes	Model type: `"auto"`, `"text"`, `"vision"`, `"diffusion"`, `"speech"`, `"embedding"`
`model_id`	yes	HuggingFace model ID or local path
`tokenizer`	no	Path to local tokenizer.json
`arch`	no	Model architecture (auto-detected if not specified)
`dtype`	`"auto"`	Data type: `"auto"`, `"f16"`, `"bf16"`, `"f32"`
`chat_template`	no	Per-model chat template override
`jinja_explicit`	no	Per-model JINJA template override

[models.format] - Model Format

Option	Default	Description
`format`	auto	`"plain"` (safetensors), `"gguf"`, or `"ggml"`
`quantized_file`	none	Quantized filename(s) for GGUF/GGML (semicolon-separated)
`tok_model_id`	none	Model ID for tokenizer when using quantized format
`gqa`	`1`	GQA value for GGML models

[models.adapter] - LoRA/X-LoRA

Option	Description
`lora`	LoRA adapter ID(s), semicolon-separated
`xlora`	X-LoRA adapter ID (conflicts with lora)
`xlora_order`	X-LoRA ordering JSON file (requires xlora)
`tgt_non_granular_index`	Target non-granular index for X-LoRA

[models.quantization] - ISQ/UQFF

Option	Description
`in_situ_quant`	ISQ level: `"4"`, `"8"`, `"q4_0"`, `"q4k"`, `"q6k"`, etc.
`from_uqff`	UQFF file(s) to load (semicolon-separated). Shards are auto-discovered: specifying the first shard (e.g., `q4k-0.uqff`) automatically finds `q4k-1.uqff`, etc.
`isq_organization`	ISQ strategy: `"default"` or `"moqe"`
`imatrix`	imatrix file for enhanced quantization
`calibration_file`	Calibration file for imatrix generation

[models.device] - Device Mapping

Option	Default	Description
`cpu`	`false`	Force CPU-only (must be consistent across all models)
`device_layers`	auto	Layer mapping, e.g., `["0:10", "1:20"]` format: `ORD:NUM;...`
`topology`	none	Topology YAML file
`hf_cache`	none	Custom HuggingFace cache directory
`max_seq_len`	`4096`	Max sequence length for auto device mapping
`max_batch_size`	`1`	Max batch size for auto device mapping

[models.vision] - Vision Options

Option	Description
`max_edge`	Maximum edge length for image resizing
`max_num_images`	Maximum images per request
`max_image_length`	Maximum image dimension for device mapping

Full Examples

Multi-Model Server with UI

command = "serve"

[global]
seed = 42

[server]
host = "0.0.0.0"
port = 1234
ui = true

[runtime]
max_seqs = 32
enable_search = true
search_embedding_model = "embedding-gemma"

[paged_attn]
mode = "auto"

[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.2-3B-Instruct"
dtype = "auto"

[models.quantization]
in_situ_quant = "q4k"

[[models]]
kind = "vision"
model_id = "Qwen/Qwen2-VL-2B-Instruct"

[models.vision]
max_num_images = 4

[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"

Interactive Mode with Thinking

command = "run"
enable_thinking = true

[runtime]
max_seqs = 16

[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"

GGUF Model

command = "serve"

[server]
port = 1234

[[models]]
kind = "text"
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"

[models.format]
format = "gguf"
quantized_file = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"

Device Layer Mapping

command = "serve"

[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.1-70B-Instruct"

[models.device]
device_layers = ["0:40", "1:40"]

[models.quantization]
in_situ_quant = "q4k"

Notes

cpu must be consistent across all models if specified
default_model_id (serve only) must match a model_id in [[models]]
search_embedding_model requires enable_search = true

Troubleshooting

Common issues and solutions for mistral.rs.

Debug Mode

Enable debug mode for more information:

MISTRALRS_DEBUG=1 mistralrs run -m <model>

Debug mode causes:

If loading a GGUF or GGML model, outputs a file containing the names, shapes, and types of each tensor:
- mistralrs_gguf_tensors.txt or mistralrs_ggml_tensors.txt
Increased logging verbosity

System Diagnostics

Run the built-in diagnostics tool:

mistralrs doctor

This checks your system configuration and reports any issues.

Common Issues

CUDA Issues

Setting the CUDA compiler path:

Set the NVCC_CCBIN environment variable during build

Error: recompile with -fPIE:

Some Linux distributions require compiling with -fPIE
Set during build: CUDA_NVCC_FLAGS=-fPIE cargo build --release --features cuda

Error: CUDA_ERROR_NOT_FOUND or symbol not found:

For non-quantized models, specify the data type to load and run in
Use one of f32, f16, bf16 or auto (auto chooses based on device)
Example: mistralrs run -m <model> -d auto

Minimum CUDA compute capability:

The minimum supported CUDA compute cap is 5.3
Set a specific compute cap with: CUDA_COMPUTE_CAP=80 cargo build --release --features cuda

Metal Issues (macOS)

Metal not found (error: unable to find utility “metal”):

Install Xcode:
```
xcode-select --install
```

Set the active developer directory:

sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer

error: cannot execute tool ‘metal’ due to missing Metal toolchain

Install Metal Toolchain:

xcodebuild -downloadComponent MetalToolchain

Disabling Metal kernel precompilation:

By default, Metal kernels are precompiled during build time for better performance

To skip precompilation (useful for CI or when Metal is not needed):

MISTRALRS_METAL_PRECOMPILE=0 cargo build --release --features metal

Memory Issues

Disabling mmap loading:

Set MISTRALRS_NO_MMAP=1 to disable memory-mapped file loading
Forces all tensor data into memory
Useful if you’re seeing mmap-related errors

Out of memory errors:

Try using quantization: --isq q4k or --isq q8_0
Use device mapping to offload layers: -n 0:16;cpu:16
Reduce context length with PagedAttention: --pa-context-len 4096

Model Loading Issues

Model type not auto-detected:

If auto-detection fails, please raise an issue
You can manually specify the architecture if needed

Chat template issues:

Templates are usually auto-detected
Override with: -c /path/to/template.jinja
See Chat Templates for details

Getting Help

If you’re still stuck:

Discord - Community support
Matrix - Alternative chat
GitHub Issues - Bug reports and feature requests

When reporting issues, please include:

Output of mistralrs doctor
Full error message
Command you ran
Hardware (GPU model, OS)

mistralrs Rust SDK

The mistralrs crate provides a high-level Rust API for running LLM inference with mistral.rs.

Full API reference: docs.rs/mistralrs

Table of contents

Installation
Quick Start
Model Builders
Request Types
Streaming
Structured Output
Tool Calling
Agents
Blocking API
Feature Flags
Examples

Installation

cargo add mistralrs

Or in your Cargo.toml:

[dependencies]
mistralrs = "0.7"

For GPU acceleration, enable the appropriate feature:

mistralrs = { version = "0.7", features = ["metal"] }     # macOS
mistralrs = { version = "0.7", features = ["cuda"] }       # NVIDIA

Quick Start

use mistralrs::{IsqBits, ModelBuilder, TextMessages, TextMessageRole};

#[tokio::main]
async fn main() -> mistralrs::error::Result<()> {
    let model = ModelBuilder::new("Qwen/Qwen3-4B")
        .with_auto_isq(IsqBits::Four)
        .build()
        .await?;

    let response = model.chat("What is Rust's ownership model?").await?;
    println!("{response}");
    Ok(())
}

Model Builders

All models are created through builder structs. Use ModelBuilder for auto-detection, or a specific builder for more control.

Builder	Use Case
`ModelBuilder`	Auto-detects model type (text, vision, embedding)
`TextModelBuilder`	Text generation models
`VisionModelBuilder`	Vision + text models (image/audio input)
`GgufModelBuilder`	GGUF quantized model files
`EmbeddingModelBuilder`	Text embedding models
`DiffusionModelBuilder`	Image generation (e.g., FLUX)
`SpeechModelBuilder`	Speech synthesis (e.g., Dia)
`LoraModelBuilder`	Text model with LoRA adapters
`XLoraModelBuilder`	Text model with X-LoRA adapters
`AnyMoeModelBuilder`	AnyMoE Mixture of Experts
`TextSpeculativeBuilder`	Speculative decoding (target + draft)

All builders share common configuration methods:

#![allow(unused)]
fn main() {
let model = TextModelBuilder::new("Qwen/Qwen3-4B")
    .with_auto_isq(IsqBits::Four)      // Platform-optimal quantization
    .with_logging()                      // Enable logging
    .with_paged_attn(                    // PagedAttention for memory efficiency
        PagedAttentionMetaBuilder::default().build()?
    )
    .build()
    .await?;
}

Key builder methods include with_isq(), with_auto_isq(), with_dtype(), with_force_cpu(), with_device_mapping(), with_chat_template(), with_paged_attn(), with_max_num_seqs(), with_mcp_client(), and more. See the API docs for the full list.

Request Types

Type	Use When	Sampling
`TextMessages`	Simple text-only chat	Deterministic
`VisionMessages`	Prompt includes images or audio	Deterministic
`RequestBuilder`	Tools, logprobs, custom sampling, constraints, or web search	Configurable

TextMessages and VisionMessages convert into RequestBuilder via Into<RequestBuilder> if you start simple and later need more control.

#![allow(unused)]
fn main() {
// Simple
let messages = TextMessages::new()
    .add_message(TextMessageRole::User, "Hello!");
let response = model.send_chat_request(messages).await?;

// Advanced
let request = RequestBuilder::new()
    .add_message(TextMessageRole::System, "You are helpful.")
    .add_message(TextMessageRole::User, "Hello!")
    .set_tools(tools)
    .with_sampling(SamplingParams { temperature: Some(0.7), ..Default::default() });
let response = model.send_chat_request(request).await?;
}

Streaming

Model::stream_chat_request returns a Stream that implements futures::Stream:

#![allow(unused)]
fn main() {
use futures::StreamExt;
use mistralrs::*;

let mut stream = model.stream_chat_request(messages).await?;
while let Some(chunk) = stream.next().await {
    if let Response::Chunk(c) = chunk {
        if let Some(text) = c.choices.first().and_then(|ch| ch.delta.content.as_ref()) {
            print!("{text}");
        }
    }
}
}

Structured Output

Derive schemars::JsonSchema on your type and the model will be constrained to produce valid JSON:

#![allow(unused)]
fn main() {
use mistralrs::*;
use schemars::JsonSchema;
use serde::Deserialize;

#[derive(Deserialize, JsonSchema)]
struct City {
    name: String,
    country: String,
    population: u64,
}

let messages = TextMessages::new()
    .add_message(TextMessageRole::User, "Give me info about Paris.");

let city: City = model.generate_structured::<City>(messages).await?;
println!("{}: pop. {}", city.name, city.population);
}

Tool Calling

Manual tool definition

#![allow(unused)]
fn main() {
let tools = vec![Tool {
    tp: ToolType::Function,
    function: Function {
        description: Some("Get the weather for a location".to_string()),
        name: "get_weather".to_string(),
        parameters: Some(parameters_json),
    },
}];

let request = RequestBuilder::new()
    .add_message(TextMessageRole::User, "What's the weather in NYC?")
    .set_tools(tools);

let response = model.send_chat_request(request).await?;
}

Using the `#[tool]` macro

#![allow(unused)]
fn main() {
use mistralrs::tool;

#[tool(description = "Get the current weather for a location")]
fn get_weather(
    #[description = "The city name"] city: String,
) -> Result<String> {
    Ok(format!("Sunny, 72F in {city}"))
}
}

See Tool Calling for full details, or the examples/advanced/tools/ example.

Agents

AgentBuilder wraps the tool-calling loop, automatically dispatching tool calls and feeding results back:

#![allow(unused)]
fn main() {
use mistralrs::*;

let agent = AgentBuilder::new(model)
    .with_system_prompt("You are a helpful assistant with tools.")
    .with_sync_tool(get_weather_tool, get_weather_callback)
    .with_max_iterations(10)
    .build();

let response = agent.run("What's the weather in NYC and London?").await?;
println!("{}", response.final_text);
}

See the examples/advanced/agent/ example for streaming agents and the #[tool] macro.

Blocking API

For non-async applications, use BlockingModel:

use mistralrs::blocking::BlockingModel;
use mistralrs::{IsqBits, ModelBuilder};

fn main() -> mistralrs::error::Result<()> {
    let model = BlockingModel::from_builder(
        ModelBuilder::new("Qwen/Qwen3-4B")
            .with_auto_isq(IsqBits::Four),
    )?;
    let answer = model.chat("What is 2+2?")?;
    println!("{answer}");
    Ok(())
}

Note: BlockingModel creates its own tokio runtime. Do not call it from within an existing tokio runtime.

Feature Flags

Flag	Effect
`cuda`	CUDA GPU support
`flash-attn`	Flash Attention 2 kernels (requires `cuda`)
`cudnn`	cuDNN acceleration (requires `cuda`)
`nccl`	Multi-GPU via NCCL (requires `cuda`)
`metal`	Apple Metal GPU support
`accelerate`	Apple Accelerate framework
`mkl`	Intel MKL acceleration

The default feature set (no flags) builds with pure Rust — no C compiler or system libraries required.

Examples

The crate includes 48 runnable examples organized by topic:

Category	Examples
Getting Started	`text_generation`, `streaming`, `vision`, `gguf`, `gguf_locally`, `embedding`
Models	`text_models`, `vision_models`, `audio`, `diffusion`, `speech`, `multimodal`
Quantization	`isq`, `imatrix`, `uqff`, `topology`, `mixture_of_quant_experts`
Advanced	`tools`, `agent`, `grammar`, `json_schema`, `web_search`, `mcp_client`, `batching`, `paged_attn`, `speculative`, `lora`, `error_handling`, and more
Cookbook	`cookbook_rag`, `cookbook_structured`, `cookbook_multiturn`, `cookbook_agent`

Run any example with:

cargo run --release --features <features> --example <name>

Browse all examples: mistralrs/examples/

mistralrs Python SDK

Documentation for the mistralrs Python package.

Installation: See PYTHON_INSTALLATION.md for installation instructions.

Table of contents

Full API reference: here
Model configuration (Which enum): here
Multi-model support: here
MCP Client Configuration: here
Example: here
Embeddings example: here

`Which`

Each *_model_id may be a HF hub repo or a local path. For quantized GGUF models, a list is accepted if multiple files must be specified.

Architecture for plain models

If you do not specify the architecture, an attempt will be made to use the model’s config. If this fails, please raise an issue.

Mistral
Gemma
Mixtral
Llama
Phi2
Phi3
Qwen2
Gemma2
GLM4
GLM4MoeLite
GLM4Moe
Starcoder2
Phi3_5MoE
DeepseekV2
DeepseekV3
Qwen3
Qwen3Moe
SmolLm3
GraniteMoeHybrid
GptOss
Qwen3Next

ISQ Organization

Default
MoQE: if applicable, only quantize MoE experts. https://arxiv.org/abs/2310.02410

Architecture for vision models

Phi3V
Idefics2
LLaVaNext
LLaVa
VLlama
Qwen2VL
Idefics3
MiniCpmO
Phi4MM
Qwen2_5VL
Gemma3
Mistral3
Llama4
Gemma3n
Qwen3VL
Qwen3VLMoE
Qwen3_5
Qwen3_5Moe
Voxtral

Architecture for diffusion models

Flux
FluxOffloaded

Architecture for speech models

Dia

Architecture for embedding models

EmbeddingGemma
Qwen3Embedding

ISQ Organization

Default
MoQE: if applicable, only quantize MoE experts. https://arxiv.org/abs/2310.02410

Note: from_uqff specifies a UQFF path to load from. If provided, this takes precedence over applying ISQ. For sharded models, you only need to specify the first shard (e.g., q4k-0.uqff) – the remaining shards are auto-discovered. For multiple different quantizations, use a semicolon delimiter (;).

Note: enable_thinking enables thinking for models that support the configuration. Note: truncate_sequence=True trims prompts that would otherwise exceed the model’s maximum context length. Leave it False to receive a validation error instead.

class Which(Enum):
    @dataclass
    class Plain:
        model_id: str
        arch: Architecture | None = None
        tokenizer_json: str | None = None
        topology: str | None = None
        organization: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)
        calibration_file: str | None = None
        imatrix: str | None = None
        hf_cache_path: str | None = None

    @dataclass
    class XLora:
        xlora_model_id: str
        order: str
        arch: Architecture | None = None
        model_id: str | None = None
        tokenizer_json: str | None = None
        tgt_non_granular_index: int | None = None
        topology: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)
        hf_cache_path: str | None = None

    @dataclass
    class Lora:
        adapter_model_id: str
        arch: Architecture | None = None
        model_id: str | None = None
        tokenizer_json: str | None = None
        topology: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)
        hf_cache_path: str | None = None

    @dataclass
    class GGUF:
        quantized_model_id: str
        quantized_filename: str | list[str]
        tok_model_id: str | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class XLoraGGUF:
        quantized_model_id: str
        quantized_filename: str | list[str]
        xlora_model_id: str
        order: str
        tok_model_id: str | None = None
        tgt_non_granular_index: int | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class LoraGGUF:
        quantized_model_id: str
        quantized_filename: str | list[str]
        adapters_model_id: str
        order: str
        tok_model_id: str | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class GGML:
        quantized_model_id: str
        quantized_filename: str
        tok_model_id: str | None = None
        tokenizer_json: str | None = None
        gqa: int | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class XLoraGGML:
        quantized_model_id: str
        quantized_filename: str
        xlora_model_id: str
        order: str
        tok_model_id: str | None = None
        tgt_non_granular_index: int | None = None
        tokenizer_json: str | None = None
        gqa: int | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class LoraGGML:
        quantized_model_id: str
        quantized_filename: str
        adapters_model_id: str
        order: str
        tok_model_id: str | None = None
        tokenizer_json: str | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class Embedding:
        model_id: str
        arch: EmbeddingArchitecture | None = None
        tokenizer_json: str | None = None
        topology: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        hf_cache_path: str | None = None

    @dataclass
    class VisionPlain:
        model_id: str
        arch: VisionArchitecture
        tokenizer_json: str | None = None
        topology: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        max_edge: int | None = None
        auto_map_params: VisionAutoMapParams | None = (None,)
        calibration_file: str | None = None
        imatrix: str | None = None
        hf_cache_path: str | None = None

    @dataclass
    class DiffusionPlain:
        model_id: str
        arch: DiffusionArchitecture
        dtype: ModelDType = ModelDType.Auto

    @dataclass
    class Speech:
        model_id: str
        arch: DiffusionArchitecture
        dac_model_id: str | None = None
        dtype: ModelDType = ModelDType.Auto

Multi-model Support

The mistralrs Python SDK supports running multiple models using the Runner class with the model_id parameter. All request methods accept an optional model_id to target a specific model. When model_id is None or omitted, the default model is used. If aliases are configured (for example via the server config or Rust MultiModelBuilder), list_models() will return those aliases and you can pass them in requests; canonical pipeline names remain accepted.

Basic Usage with model_id

import mistralrs

# Create a Runner with a vision model (Gemma 3 4B)
runner = mistralrs.Runner(
    which=mistralrs.Which.VisionPlain(
        model_id="google/gemma-3-4b-it",
        arch=mistralrs.VisionArchitecture.Gemma3,
    ),
    in_situ_quant="Q4K",
)

# List available models (model IDs are registered IDs, aliases if configured)
models = runner.list_models()
print(f"Available models: {models}")  # ["google/gemma-3-4b-it"]

# Send request to specific model using model_id parameter
response = runner.send_chat_completion_request(
    mistralrs.ChatCompletionRequest(
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=100
    ),
    model_id="google/gemma-3-4b-it"  # Target specific model
)

# Send request without model_id (uses default model)
response = runner.send_chat_completion_request(
    mistralrs.ChatCompletionRequest(
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=100
    )
)

Multi-model Management

# List available models
models = runner.list_models()
print(f"Available models: {models}")

# Get/set default model
default_model = runner.get_default_model_id()
print(f"Default model: {default_model}")

# Change default model (model must be loaded)
runner.set_default_model_id("google/gemma-3-4b-it")

# List models with their status
models_with_status = runner.list_models_with_status()
for model_id, status in models_with_status:
    print(f"{model_id}: {status}")  # status is "loaded", "unloaded", or "reloading"

Model Unloading and Reloading

You can unload models to free memory and reload them on demand:

model_id = "google/gemma-3-4b-it"

# Check if model is loaded
is_loaded = runner.is_model_loaded(model_id)
print(f"Model loaded: {is_loaded}")

# List models with their status
models_with_status = runner.list_models_with_status()
for mid, status in models_with_status:
    print(f"{mid}: {status}")

# Unload a model to free memory (preserves configuration for reload)
runner.unload_model(model_id)

# Check status after unload
is_loaded = runner.is_model_loaded(model_id)
print(f"Model loaded after unload: {is_loaded}")  # False

# Manually reload a model
runner.reload_model(model_id)

# Auto-reload: sending a request to an unloaded model will reload it automatically
response = runner.send_chat_completion_request(
    mistralrs.ChatCompletionRequest(
        messages=[{"role": "user", "content": "Hello!"}]
    ),
    model_id=model_id  # Will auto-reload if unloaded
)

Request Methods with model_id

All request methods accept an optional model_id parameter:

# Chat completion
response = runner.send_chat_completion_request(request, model_id="model-id")

# Completion
response = runner.send_completion_request(request, model_id="model-id")

# Embeddings
embeddings = runner.send_embedding_request(request, model_id="model-id")

# Image generation
image = runner.generate_image(prompt, response_format, model_id="model-id")

# Audio generation
audio = runner.generate_audio(prompt, model_id="model-id")

# Tokenization
tokens = runner.tokenize_text(text, add_special_tokens=True, model_id="model-id")
text = runner.detokenize_text(tokens, skip_special_tokens=True, model_id="model-id")

When model_id is None or omitted, the default model is used.

Server Configuration

For server-based multi-model deployment, see the multi-model documentation.

MCP Client

The mistralrs Python SDK now supports Model Context Protocol (MCP) clients, enabling AI assistants to connect to and interact with external tools and resources through standardized server interfaces.

MCP Server Configuration

Configure MCP servers using McpServerConfigPy:

# HTTP-based MCP server with Bearer token authentication
http_server = mistralrs.McpServerConfigPy(
    id="web_search",
    name="Web Search MCP",
    source=mistralrs.McpServerSourcePy.Http(
        url="https://api.example.com/mcp",
        timeout_secs=30,
        headers={"X-API-Version": "v1"}  # Optional additional headers
    ),
    enabled=True,
    tool_prefix="web",  # Prefixes tool names to avoid conflicts
    resources=None,
    bearer_token="your-api-token"  # Automatically added as Authorization header
)

# Process-based MCP server for local tools
process_server = mistralrs.McpServerConfigPy(
    id="filesystem",
    name="Filesystem MCP",
    source=mistralrs.McpServerSourcePy.Process(
        command="mcp-server-filesystem",
        args=["--root", "/tmp"],
        work_dir=None,
        env={"MCP_LOG_LEVEL": "debug"}  # Optional environment variables
    ),
    enabled=True,
    tool_prefix="fs",
    resources=["file://**"],  # Resource patterns this client is interested in
    bearer_token=None  # Process servers typically don't need authentication
)

# WebSocket-based MCP server for real-time communication
websocket_server = mistralrs.McpServerConfigPy(
    id="realtime_data",
    name="Real-time Data MCP",
    source=mistralrs.McpServerSourcePy.WebSocket(
        url="wss://realtime.example.com/mcp",
        timeout_secs=60,
        headers=None
    ),
    enabled=True,
    tool_prefix="rt",
    resources=None,
    bearer_token="websocket-token"  # WebSocket Bearer token support
)

MCP Client Configuration

Configure the MCP client using McpClientConfigPy:

mcp_config = mistralrs.McpClientConfigPy(
    servers=[http_server, process_server, websocket_server],
    auto_register_tools=True,  # Automatically discover and register tools
    tool_timeout_secs=30,      # Timeout for individual tool calls
    max_concurrent_calls=5     # Maximum concurrent tool calls across all servers
)

Integration with Runner

Pass the MCP client configuration to the Runner:

runner = mistralrs.Runner(
    which=mistralrs.Which.GGUF(
        tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
        quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
        quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    ),
    mcp_client_config=mcp_config  # MCP tools automatically registered
)

When auto_register_tools=True, the MCP client will:

Connect to all enabled MCP servers
Discover available tools from each server
Register them for automatic tool calling with appropriate prefixes
Make them available during model conversations

MCP Transport Types

HTTP Transport: Best for public APIs, RESTful services, servers behind load balancers. Supports SSE (Server-Sent Events) and standard HTTP semantics.
Process Transport: Best for local tools, development servers, sandboxed environments. Provides process isolation with no network overhead.
WebSocket Transport: Best for interactive applications, real-time data, low-latency requirements. Supports persistent connections and server-initiated notifications.

Authentication

Bearer Tokens: Automatically added as Authorization: Bearer <token> header for HTTP and WebSocket connections
Custom Headers: Additional headers can be specified for API keys, versioning, etc.
Process Servers: Typically don’t require authentication as they run locally

Example

from mistralrs import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.GGUF(
        tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
        quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
        quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    )
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role":"user", "content":"Tell me a story about the Rust type system."}],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Embeddings example

from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which

runner = Runner(
    which=Which.Embedding(
        model_id="google/embeddinggemma-300m",
        arch=EmbeddingArchitecture.EmbeddingGemma,
    )
)

embeddings = runner.send_embedding_request(
    EmbeddingRequest(
        input=[
            "task: query | text: superconductors",
            "task: query | text: graphene",
        ],
        truncate_sequence=True,
    )
)

print(len(embeddings), len(embeddings[0]))

# Swap the model_id and arch below to load Qwen/Qwen3-Embedding-0.6B instead:
# Runner(
#     which=Which.Embedding(
#         model_id="Qwen/Qwen3-Embedding-0.6B",
#         arch=EmbeddingArchitecture.Qwen3Embedding,
#     )
# )

Python SDK Installation

Quick Install from PyPI (Recommended)

Pre-built wheels are available for common platforms. Choose the package that matches your hardware:

Hardware	Install Command
Recommended (auto-optimized)	`pip install mistralrs`
NVIDIA GPUs (CUDA)	`pip install mistralrs-cuda`
Apple Silicon (Metal)	`pip install mistralrs-metal`
Apple Accelerate	`pip install mistralrs-accelerate`
Intel CPUs (MKL)	`pip install mistralrs-mkl`

Platform-Specific Optimizations

The mistralrs base package includes platform-specific optimizations:

macOS Apple Silicon: Metal GPU support built-in
Linux/Windows x86_64: Intel MKL optimizations built-in
Linux aarch64: CPU-only (use mistralrs-cuda for GPU support)

All packages install the mistralrs Python module. The package suffix controls which accelerator features are enabled.

Supported Platforms

Package	Linux x86_64	Linux aarch64	Windows x86_64	macOS aarch64
mistralrs	MKL	CPU	MKL	Metal
mistralrs-cuda	CUDA	CUDA	CUDA	-
mistralrs-metal	-	-	-	Metal
mistralrs-accelerate	-	-	-	Accelerate
mistralrs-mkl	MKL	-	MKL	-

Python version: 3.10+ (wheels use abi3 for forward compatibility)

Windows Requirements

It is recommended to use WSL2 on Windows machines.

On Windows, additional runtime dependencies may be required:

CUDA packages: Install the NVIDIA CUDA Toolkit and ensure the bin directory is in your PATH
MKL packages: Install the Intel oneAPI Math Kernel Library runtime

# Example: Install with CUDA support
pip install mistralrs-cuda -v

Build from Source

Building from source gives you access to the latest features and allows customization of build options.

Prerequisites

Install system packages:

Ubuntu/Debian:

sudo apt install libssl-dev pkg-config

macOS:

brew install openssl pkg-config

Install Rust from https://rustup.rs/:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

(Optional) Set up HuggingFace authentication for gated models:
```
mkdir -p ~/.cache/huggingface
echo "YOUR_HF_TOKEN" > ~/.cache/huggingface/token
```
Or use huggingface-cli login.

Build Steps

Clone the repository:

git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs/mistralrs-pyo3

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or: .venv\Scripts\activate  # Windows

Install maturin (Rust + Python build tool):
```
pip install maturin[patchelf]
```

Build and install:

maturin develop -r --features <your-features>

Feature Flags

Feature	Description
`cuda`	NVIDIA GPU support
`flash-attn`	Flash Attention (CUDA, Ampere+)
`flash-attn-v3`	Flash Attention v3 (CUDA, Hopper)
`cudnn`	cuDNN optimizations
`metal`	Apple Silicon GPU (macOS only)
`accelerate`	Apple Accelerate framework
`mkl`	Intel MKL

Example with CUDA and Flash Attention:

maturin develop -r --features "cuda flash-attn cudnn"

Verify Installation

import mistralrs
print(mistralrs.__version__)

Quick test:

from mistralrs import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-0.6B"),
)

response = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=50,
    )
)
print(response.choices[0].message.content)

Next Steps

SDK Documentation - Full SDK reference
Examples - Python examples
Cookbook - Interactive tutorial

HTTP server

Mistral.rs provides a lightweight OpenAI API compatible HTTP server based on axum. The request and response formats are supersets of the OpenAI API.

The API consists of the following endpoints. They can be viewed in your browser interactively by going to http://localhost:<port>/docs.

ℹ️ Besides the HTTP endpoints described below, mistralrs serve can also expose the same functionality via the MCP protocol. Enable it with --mcp-port <port> and see MCP/server.md for details.

Additional object keys

To support additional features, we have extended the completion and chat completion request objects. Both have the same keys added:

top_k: int | null. If non null, it is only relevant if positive.
grammar: {"type" : "regex" | "lark" | "json_schema" | "llguidance", "value": string} or null. Grammar to use. This is mutually exclusive to the OpenAI-compatible response_format.
min_p: float | null. If non null, it is only relevant if 1 >= min_p >= 0.
enable_thinking: bool, default to false. Enable thinking for models that support it.
truncate_sequence: bool | null. When true, requests that exceed the model context length will be truncated instead of rejected; otherwise the server returns a validation error. Embedding requests truncate tokens at the end of the prompt, while chat/completion requests truncate tokens at the start of the prompt.
repetition_penalty: float | null. Penalty for repeating tokens. This is distinct from frequency_penalty and presence_penalty - it applies a direct multiplicative penalty to repeated token logits.
web_search_options: object | null. Enable web search integration (see WEB_SEARCH.md). Contains optional fields: search_context_size (“low”, “medium”, “high”), user_location (object with location info), search_description (override search tool description), extract_description (override extraction tool description).
reasoning_effort: string | null. For Harmony-format models (like GPT-OSS), controls the depth of reasoning: "low", "medium", or "high".
dry_multiplier: float | null. DRY (Don’t Repeat Yourself) sampling multiplier. Controls the strength of the anti-repetition penalty.
dry_base: float | null. DRY sampling base value.
dry_allowed_length: int | null. DRY sampling allowed length before penalty applies.
dry_sequence_breakers: array of strings | null. Tokens that reset the DRY penalty sequence.

Response Extensions

The response objects include additional fields beyond the standard OpenAI API:

Harmony Mode Responses

For models using Harmony format (like GPT-OSS), responses may include additional reasoning content:

reasoning_content: string | null. Chain-of-thought reasoning from Harmony-format models. This field contains the model’s internal analysis and commentary that led to the final response. It is separate from the main content field.

When streaming, reasoning_content appears in the delta object alongside content.

Example response:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me analyze this step by step..."
    }
  }]
}

Model Parameter Validation

Mistral.rs validates that the model parameter in API requests matches the model that was actually loaded by the server. This ensures requests are processed by the correct model and prevents confusion.

Behavior:

If the model parameter matches the loaded model name, the request proceeds normally
If the model parameter doesn’t match, the request fails with an error message indicating the mismatch
The special model name "default" can be used to bypass this validation entirely

Examples:

✅ Request with "model": "meta-llama/Llama-3.2-3B-Instruct" when meta-llama/Llama-3.2-3B-Instruct is loaded -> succeeds
❌ Request with "model": "gpt-4" when mistral-7b-instruct is loaded -> fails
✅ Request with "model": "default" regardless of loaded model -> always succeeds

Usage: Use "default" in the model field when you need to satisfy API clients that require a model parameter but don’t need to specify a particular model. This is demonstrated in all the examples below.

`POST`: `/v1/chat/completions`

Process an OpenAI compatible request, returning an OpenAI compatible response when finished. Please find the official OpenAI API documentation here. To control the interval keep-alive messages are sent, set the KEEP_ALIVE_INTERVAL environment variable to the desired time in ms.

To send a request with the Python openai library:

import openai

client = openai.OpenAI(
    base_url="http://localhost:1234/v1", # "http://<Your api-server IP>:port"
    api_key = "EMPTY"
)

completion = client.chat.completions.create(
model="default",
messages=[
    {"role": "system", "content": "You are Mistral.rs, an AI assistant."},
    {"role": "user", "content": "Write a story about Rust error handling."}
]
)

print(completion.choices[0].message)

Or with curl:

curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"messages": [
{
    "role": "system",
    "content": "You are Mistral.rs, an AI assistant."
},
{
    "role": "user",
    "content": "Write a story about Rust error handling."
}
]
}'

A streaming request can also be created by setting "stream": true in the request JSON. Please see this guide.

ℹ️ Requests whose prompt exceeds the model’s maximum context length now fail unless you opt in to truncation. Set "truncate_sequence": true to drop the oldest prompt tokens while reserving room (equal to max_tokens when provided, otherwise one token) for generation. Specifically, tokens from the front of the prompt are dropped.

`GET`: `/v1/models`

Returns the running models.

Example with curl:

curl http://localhost:<port>/v1/models

`GET`: `/` or `/health`

Returns the server health.

Example with curl:

curl http://localhost:<port>/health

`GET`: `/docs`

Returns OpenAPI API docs via SwaggerUI.

Example with curl:

curl http://localhost:<port>/docs

`POST`: `/v1/completions`

Process an OpenAI compatible completions request, returning an OpenAI compatible response when finished. Please find the official OpenAI API documentation here.

Completions-specific parameters

In addition to the common parameters listed above, the completions endpoint supports:

best_of: int | null. Generate best_of completions server-side and return the best one (the one with the highest log probability per token). When used with n, best_of must be greater than n.
echo: bool, default false. Echo back the prompt in addition to the completion.

To send a request with the Python openai library:

import openai

client = openai.OpenAI(
    base_url="http://localhost:1234/v1", # "http://<Your api-server IP>:port"
    api_key = "EMPTY"
)

completion = client.completions.create(
    model="default",
    prompt="What is Rust?",
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)

print(completion.choices[0].message)

Or with curl:

curl http://localhost:1234/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"prompt": "What is Rust?"
}'

ℹ️ The truncate_sequence flag behaves the same way for the completions endpoint: keep it false (default) to receive a validation error, or set it to true to trim the prompt automatically.

`POST`: `/v1/embeddings`

Serve an embedding model (for example, EmbeddingGemma) to enable this endpoint:

mistralrs serve -m google/embeddinggemma-300m

In multi-model mode, include an Embedding entry in your selector config to expose it alongside chat models.

Create vector embeddings via the OpenAI-compatible endpoint. Supported request fields:

input: a single string, an array of strings, an array of token IDs ([123, 456]), or a batch of token arrays ([[...], [...]]).
encoding_format: "float" (default) returns arrays of f32; "base64" returns Base64 strings.
dimensions: currently unsupported; providing it yields a validation error.
truncate_sequence: bool, default false. Set to true to clip over-length prompts instead of receiving a validation error.

ℹ️ Requests whose prompt exceeds the model’s maximum context length now fail unless you opt in to truncation. Embedding requests truncate tokens from the end of the prompt.

Example (Python openai client):

import openai

client = openai.OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="EMPTY",
)

result = client.embeddings.create(
    model="default",
    input=[
        "Embeddings capture semantic relationships between texts.",
        "What is graphene?",
    ],
    truncate_sequence=True,
)

for item in result.data:
    print(item.index, len(item.embedding))

Example with curl:

curl http://localhost:1234/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "default",
    "input": ["graphene conductivity", "superconductor basics"],
    "encoding_format": "base64",
    "truncate_sequence": false
  }'

Responses follow the OpenAI schema: object: "list", data[*].embedding containing either float arrays or Base64 strings depending on encoding_format, and a usage block (prompt_tokens, total_tokens). At present those counters report 0 because token accounting for embeddings is not yet implemented.

`POST`: `/v1/images/generations`

Generate images using diffusion models (like FLUX). First, serve a diffusion model:

mistralrs serve -m black-forest-labs/FLUX.1-schnell

Supported request fields:

model: Model identifier (use "default" to bypass validation)
prompt: Text description of the image to generate
n: Number of images to generate (default: 1)
response_format: "url" or "b64_json" (default: "url")
height: Image height in pixels (default: 720)
width: Image width in pixels (default: 1280)

Example with Python:

import openai
import base64

client = openai.OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="EMPTY",
)

response = client.images.generate(
    model="default",
    prompt="A majestic snow-covered mountain at sunset",
    n=1,
    response_format="b64_json",
    size="1280x720",  # width x height
)

# Save the generated image
image_data = base64.b64decode(response.data[0].b64_json)
with open("output.png", "wb") as f:
    f.write(image_data)

Example with curl:

curl http://localhost:1234/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "default",
    "prompt": "A majestic snow-covered mountain at sunset",
    "n": 1,
    "response_format": "b64_json",
    "height": 720,
    "width": 1280
  }'

`POST`: `/v1/audio/speech`

Generate speech from text using speech models (like Dia). First, serve a speech model:

mistralrs serve -m nari-labs/Dia-1.6B

Supported request fields:

model: Model identifier (use "default" to bypass validation)
input: Text to convert to speech. For Dia models, use speaker tags like [S1] and [S2] to control multiple voices
response_format: "wav" or "pcm" (only these formats are supported)

Note: The voice and instructions fields from the OpenAI API are currently ignored.

Example with Python:

import requests

response = requests.post(
    "http://localhost:1234/v1/audio/speech",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer EMPTY",
    },
    json={
        "model": "default",
        "input": "[S1] Hello, how are you today? [S2] I'm doing great, thanks for asking!",
        "response_format": "wav",
    },
)

# Save the audio file
with open("output.wav", "wb") as f:
    f.write(response.content)

Example with curl:

curl http://localhost:1234/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "default",
    "input": "[S1] Dia is an open weights text to dialogue model. [S2] Try it now!",
    "response_format": "wav"
  }' \
  --output output.wav

The response is raw audio data with the appropriate Content-Type header (audio/wav for WAV format, audio/pcm for PCM format).

`POST`: `/v1/responses`

Create a response using the OpenAI-compatible Responses API. Please find the official OpenAI API documentation here.

To send a request with the Python openai library:

import openai

client = openai.OpenAI(
    base_url="http://localhost:1234/v1",
    api_key = "EMPTY"
)

# First turn
resp1 = client.responses.create(
    model="default",
    input="Apples are delicious!"
)
print(resp1.output_text)

# Follow-up - no need to resend the first message
resp2 = client.responses.create(
    model="default",
    previous_response_id=resp1.id,
    input="Can you eat them?"
)
print(resp2.output_text)

Or with curl:

curl http://localhost:1234/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"input": "Tell me about Rust programming"
}'

# Follow-up using previous_response_id
curl http://localhost:1234/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"previous_response_id": "resp_12345-uuid-here",
"input": "What makes it memory safe?"
}'

The API also supports multimodal inputs (images, audio) and streaming responses by setting "stream": true in the request JSON.

ℹ️ The Responses API forwards truncate_sequence to underlying chat completions. Enable it if you want over-length conversations to be truncated rather than rejected.

`GET`: `/v1/responses/{response_id}`

Retrieve a previously created response by its ID.

Example with curl:

curl http://localhost:1234/v1/responses/resp_12345-uuid-here \
-H "Authorization: Bearer EMPTY"

`DELETE`: `/v1/responses/{response_id}`

Delete a stored response and its associated conversation history.

Example with curl:

curl -X DELETE http://localhost:1234/v1/responses/resp_12345-uuid-here \
-H "Authorization: Bearer EMPTY"

`POST`: `/re_isq`

Reapply ISQ to the model if possible. Pass the names as a JSON object with the key ggml_type to a string (the quantization level).

Example with curl:

curl http://localhost:<port>/re_isq -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{"ggml_type":"4"}'

Model Management Endpoints

These endpoints allow dynamic management of loaded models, enabling you to free memory by unloading models and reload them on demand.

`POST`: `/v1/models/unload`

Unload a model from memory while preserving its configuration for later reload. The model can be reloaded manually or will auto-reload when a request is sent to it.

Request body:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct"
}

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "unloaded"
}

Example with curl:

curl -X POST http://localhost:1234/v1/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

`POST`: `/v1/models/reload`

Manually reload a previously unloaded model. This is also triggered automatically when a request is sent to an unloaded model.

Request body:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct"
}

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "loaded"
}

Example with curl:

curl -X POST http://localhost:1234/v1/models/reload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

`POST`: `/v1/models/status`

Get the current status of a specific model.

Request body:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct"
}

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "loaded"
}

Example with curl:

curl -X POST http://localhost:1234/v1/models/status \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

Status Values

The status field in responses can be one of:

Status	Description
`loaded`	Model is loaded and ready to serve requests
`unloaded`	Model is unloaded but can be reloaded
`reloading`	Model is currently being reloaded
`not_found`	Model ID not recognized
`no_loader_config`	Model cannot be reloaded (missing loader configuration)
`internal_error`	An internal error occurred (check `error` field for details)

When an error occurs, the response may include an error field with additional details:

{
  "model_id": "unknown-model",
  "status": "not_found",
  "error": null
}

Auto-Reload Behavior

When a request (e.g., chat completion) is sent to an unloaded model, the model will automatically reload before processing the request. This enables a “lazy loading” pattern where models are only loaded when needed, helping manage GPU memory efficiently.

Models List with Status

The /v1/models endpoint includes a status field for each model:

curl http://localhost:1234/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "default",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    },
    {
      "id": "meta-llama/Llama-3.2-3B-Instruct",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local",
      "status": "loaded"
    }
  ]
}

OpenResponses API

mistral.rs supports the OpenResponses API specification.

Endpoints

POST /v1/responses - Create a response
GET /v1/responses/{id} - Retrieve a response
DELETE /v1/responses/{id} - Delete a response
POST /v1/responses/{id}/cancel - Cancel a background response

Unsupported Parameters

The following parameters are accepted for API compatibility but will return errors if set to non-default values:

Parameter	Behavior
`parallel_tool_calls`	Only `true` or omitted is supported; `false` returns an error
`max_tool_calls`	Not supported; setting any value returns an error

mistral.rs Extensions

These additional parameters are available beyond the spec:

stop - Stop sequences
repetition_penalty - Token repetition penalty
top_k - Top-k sampling
grammar - Constrained generation grammar
min_p - Min-p sampling
dry_multiplier, dry_base, dry_allowed_length, dry_sequence_breakers - DRY sampling
web_search_options - Web search integration

See HTTP.md for usage examples.

Supported Models

Complete reference for model support in mistral.rs.

Model Categories

Text Models

Granite 4.0
SmolLM 3
DeepSeek V3
GPT-OSS
DeepSeek V2
Qwen 3 Next
Qwen 3 MoE
Phi 3.5 MoE
Qwen 3
GLM 4
GLM-4.7-Flash
GLM-4.7 (MoE)
Gemma 2
Qwen 2
Starcoder 2
Phi 3
Mixtral
Phi 2
Gemma
Llama
Mistral

Vision Models

Qwen 3-VL
Qwen 3-VL MoE
Qwen 3.5
Gemma 3n
Llama 4
Gemma 3
Mistral 3
Phi 4 multimodal
Qwen 2.5-VL
MiniCPM-O
Llama 3.2 Vision
Qwen 2-VL
Idefics 3
Idefics 2
LLaVA Next
LLaVA
Phi 3V

Speech Models

Voxtral (ASR/speech-to-text)
Dia

Image Generation Models

FLUX

Embedding Models

Embedding Gemma
Qwen 3 Embedding

Request a new model

Supported GGUF Architectures

Plain:

llama
phi2
phi3
starcoder2
qwen2
qwen3
qwen3moe

With adapters:

llama
phi3

Quantization Support

Model	GGUF	GGML	ISQ
Mistral	✅		✅
Gemma			✅
Llama	✅	✅	✅
Mixtral	✅		✅
Phi 2	✅		✅
Phi 3	✅		✅
Phi 3.5 MoE			✅
Qwen 2.5			✅
Phi 3 Vision			✅
Idefics 2			✅
Gemma 2			✅
GLM4			✅
GLM-4.7-Flash (MoE)			✅
GLM-4.7 (MoE)			✅
Starcoder 2		✅	✅
LLaVa Next			✅
LLaVa			✅
Llama 3.2 Vision			✅
Qwen2-VL			✅
Idefics 3			✅
Deepseek V2			✅
Deepseek V3			✅
MiniCPM-O 2.6			✅
Qwen2.5-VL			✅
Gemma 3			✅
Mistral 3			✅
Llama 4			✅
Qwen 3	✅		✅
SmolLM3			✅
Dia 1.6b			✅
Voxtral			✅
Gemma 3n			✅
Qwen 3 VL			✅
Qwen 3 MoE	✅		✅
Qwen 3-VL MoE			✅
Qwen 3.5			✅
Qwen 3 Next			✅
Phi 4 Multimodal			✅
Granite 4.0			✅
GPT-OSS			✅

Device Mapping Support

Model category	Supported
Plain	✅
GGUF	✅
GGML
Vision Plain	✅

X-LoRA and LoRA Support

Model	X-LoRA	X-LoRA+GGUF	X-LoRA+GGML
Mistral	✅	✅
Gemma	✅
Llama	✅	✅	✅
Mixtral	✅	✅
Phi 2	✅
Phi 3	✅	✅
Phi 3.5 MoE
Qwen 2.5
Phi 3 Vision
Idefics 2
Gemma 2	✅
GLM4	✅
GLM-4.7-Flash (MoE)
GLM-4.7 (MoE)
Starcoder 2	✅
LLaVa Next
LLaVa
Qwen2-VL
Idefics 3
Deepseek V2
Deepseek V3
MiniCPM-O 2.6
Qwen2.5-VL
Gemma 3
Mistral 3
Llama 4
Qwen 3
SmolLM3	✅
Gemma 3n
Voxtral
Qwen 3 VL
Qwen 3-VL MoE
Qwen 3.5
Qwen 3 Next
Phi 4 Multimodal
Llama 3.2 Vision
Granite 4.0
GPT-OSS

AnyMoE Support

Model	AnyMoE
Mistral 7B	✅
Gemma	✅
Llama	✅
Mixtral
Phi 2	✅
Phi 3	✅
Phi 3.5 MoE
Qwen 2.5	✅
Phi 3 Vision
Idefics 2
Gemma 2	✅
GLM-4.7-Flash (MoE)
GLM-4.7 (MoE)
Starcoder 2	✅
LLaVa Next	✅
LLaVa	✅
Llama 3.2 Vision
Qwen2-VL
Idefics 3	✅
Deepseek V2
Deepseek V3
MiniCPM-O 2.6
Qwen2.5-VL
Gemma 3	✅
Mistral 3	✅
Llama 4
Qwen 3
SmolLM3	✅
Gemma 3n
Voxtral
Qwen 3 VL
Qwen 3-VL MoE
Qwen 3.5
Qwen 3 Next
Phi 4 Multimodal
Dia 1.6b
Granite 4.0
GPT-OSS

Using Derivative Models

Model type is auto-detected. Use flags for quantized models and adapters:

Model Type	Required Arguments
Plain	`-m <model-id>`
GGUF Quantized	`-m <model-id> --format gguf -f <file>`
ISQ Quantized	`-m <model-id> --isq <level>`
UQFF Quantized	`-m <model-id> --from-uqff <file>`
LoRA	`-m <model-id> --lora <adapter>`
X-LoRA	`-m <model-id> --xlora <adapter> --xlora-order <file>`

Example: Zephyr GGUF model

mistralrs serve -p 1234 --log output.txt --format gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf

Chat Templates and Tokenizer

Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized.

Adapter models documentation
Chat templates documentation
LoRA and X-LoRA examples

Vision model support in mistral.rs

Mistral.rs supports various modalities of models, including vision models. Vision models take images and text as input and have the capability to reason over both.

Please see docs for the following model types:

Phi 3 Vision: PHI3V.md
Idefics2: IDEFICS2.md
LLaVA and LLaVANext: LLAVA.md
Llama 3.2 Vision: VLLAMA.md
Qwen2-VL: QWEN2VL.md
Idefics 3 and Smol VLM: IDEFICS3.md
Phi 4 Multimodal: PHI4MM.md
Gemma 3: GEMMA3.md
Gemma 3n: GEMMA3N.md
Mistral 3: MISTRAL3.md
Llama 4: LLAMA4.md
Qwen 3-VL: QWEN3VL.md
Qwen 3.5: QWEN3_5.md
MiniCPM-O 2.6: MINICPMO_2_6.md

Note for the Python and HTTP APIs: We follow the OpenAI specification for structuring the image messages and allow both base64 encoded images as well as a URL/path to the image. There are many examples of this, see this Python example.

Image generation model support in mistral.rs

Mistral.rs supports various modalities of models, including image generation models. Image generation models take text as input and generate images.

Please see docs for the following model types:

FLUX.1 FLUX.md

Speech model support in mistral.rs

Mistral.rs supports various modalities of models, including speech models. Speech models handle audio-to-text (ASR) and text-to-speech (TTS) tasks.

Please see docs for the following model types:

Voxtral (ASR/speech-to-text): VOXTRAL.md
Dia (text-to-speech): DIA.md

Embeddings Overview

Mistral.rs can load embedding models alongside chat, vision, diffusion, and speech workloads. Embedding models produce dense vector representations that you can use for similarity search, clustering, reranking, and other semantic tasks.

Supported models

Model	Notes	Documentation
EmbeddingGemma	Google’s multilingual embedding model.	EMBEDDINGGEMMA.md
Qwen3 Embedding	Qwen’s general-purpose embedding encoder.	QWEN3_EMBEDDING.md

Have another embedding model you would like supported? Open an issue with the model ID and configuration.

Usage overview

Choose a model from the table above.
Load it through one of our APIs:
- CLI/HTTP
- Python
- Rust

Detailed examples for each model live in their dedicated documentation pages.

DeepSeek V2: `deepseek-ai/DeepSeek-V2-Lite`

The DeepSeek V2 is a mixture of expert (MoE) model featuring “Multi-head Latent Attention”.

Context length of 32k tokens (Lite model), 128k tokens (full model)
64 routed experts (Lite model), 160 routed experts (full model)

mistralrs run --isq 4 -m deepseek-ai/DeepSeek-V2-Lite

Note

This model supports MoQE which can be activated in the ISQ organization parameter within the various APIs, as demonstrated below:

mistralrs run --isq 4 -m deepseek-ai/DeepSeek-V2-Lite --isq-organization moqe

HTTP API

mistralrs serve --isq 4 -p 1234 -m deepseek-ai/DeepSeek-V2-Lite

import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="deepseek-ai/DeepSeek-V2-Lite",
        arch=Architecture.DeepseekV2,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("deepseek-ai/DeepSeek-V2-Lite")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

DeepSeek V3: `deepseek-ai/DeepSeek-V3`, `deepseek-ai/DeepSeek-R1`

The DeepSeek V3 is a mixture of expert (MoE) model.

mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1

Note

The non-distill versions of the DeepSeek R1 models share the DeepSeek V3 architecture.

Note

This model supports MoQE which can be activated in the ISQ organization parameter within the various APIs, as demonstrated below:

mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1 --isq-organization moqe

Running the distill models

The various distillation models can be run out of the box.

mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

HTTP API

mistralrs serve --isq 4 -p 1234 -m deepseek-ai/DeepSeek-R1

import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="deepseek-ai/DeepSeek-R1",
        arch=Architecture.DeepseekV3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("deepseek-ai/DeepSeek-R1")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Gemma 2 Model

See the Gemma 2 model Collection

The Gemma 2 models are a family of text-to-text decoder-only LLMs. As such, the methods to use them are the same as with all other text-to-text LLMs supported by mistral.rs.

HTTP API

import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="google/gemma-2-9b-it",
        arch=Architecture.Gemma2,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Gemma 3 Model: `google/gemma-3-4b-it`

The Gemma 3 model is a family of multimodal (text+vision) models with 128k context length. The collection can be found here, with model sizes ranging from 4B to 27B.

We support the Gemma 3 Model in the Rust, Python, and HTTP APIs, including ISQ for increased performance.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image:

Credit

Prompt:

What is this?

Output:

image shows Mount Washington in New Hampshire, USA. It's a prominent peak in the White Mountains, known for its extreme weather conditions and being the highest peak in the Northeastern United States. The image captures it covered in snow with a dramatic sky above. The structures at the summit are communication towers.



The winding path visible on the mountain slopes appears to be part of the Mount Washington Auto Road, a historic road that allows vehicles to drive to the summit.

Start the server

mistralrs serve vision -p 1234 -m google/gemma-3-12b-it

Send a request

from openai import OpenAI
import httpx
import textwrap
import json


client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")


completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is this?",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

This is a minimal example of running the Gemma 3 model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("google/gemma-3-12b-it")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3-12b-it",
        arch=VisionArchitecture.Gemma3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is this?",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Gemma 3n Model: `google/gemma-3n-E4B-it`

Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs. These models support over 140 spoken languages.

The Gemma 3n Model has support in the Rust, Python, and HTTP APIs. Additionally, the Gemma 3n Model supports ISQ for increased performance.

Full multimodal support: mistral.rs supports text, audio, and vision inputs to Gemma 3n!
🪆 mistral.rs supports dynamically resizing the Gemma 3n model with that MatFormer architecture!

Gemma 3n implements the MatFormer architecture, which allows one model to be resized dynamically and tune performance on resource-constrained systems.

Mistral.rs supports this feature!

You can access it using the matformer_config_path (example config) and matformer_slice_name arguments throughout the APIs.
Prequantized UQFF models:
- Gemma 3n E4B
- Gemma 3n E2B

Using MatFormer with Gemma 3n

MatFormer allows you to dynamically adjust the model size based on your resource constraints. The Gemma 3n model comes with several pre-configured slices that offer different performance/resource trade-offs.

You can read more about MatFormer in mistral.rs here.

Available Slices

The default configuration file (matformer_configs/gemma3n.csv) includes:

Main model (3.98B params, 35 layers) - Full model with best performance
Config for official E2B Model (1.91B params, 30 layers) - Balanced performance/efficiency
Various intermediate configurations from E1.96B to E3.79B with different layer and FFN configurations

Command Line Example

# Run with the E2.49B slice for balanced performance/efficiency
mistralrs run vision -m google/gemma-3n-E4B-it \
  --matformer-config-path matformer_configs/gemma3n.csv \
  --matformer-slice-name "Config for E2.49B (block-level)"

Python SDK Example

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

# Use the E2.49B slice for balanced performance/efficiency
runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3n-E4B-it",
        arch=VisionArchitecture.Gemma3n,
        matformer_config_path="matformer_configs/gemma3n.csv",
        matformer_slice_name="Config for E2.49B (block-level)",
    ),
)

# The model will use 35 layers with mixed FFN dimensions (4096 for early layers, 8192 for middle)
# This results in ~37% parameter reduction while maintaining better performance than E2B
res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="ignore",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What do you see in this image?",
                    },
                ],
            }
        ],
        max_tokens=100,
    )
)
print(res.choices[0].message.content)

Rust SDK Example

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<()> {
    // Build model with MatFormer E2.49B configuration
    let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
        .with_isq(IsqType::Q4K)
        .with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
        .with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
        .with_logging()
        .build()
        .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "Describe this image briefly.",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    println!("Using E2.49B slice: 35 layers, 2.49B effective params");
    
    Ok(())
}

Choosing the Right Slice

Resource-constrained environments: Use “Config for official E2B Model” (1.91B params)
Balanced performance: Try E2.49B to E2.98B configurations (block-level configs offer better balance)
Maximum quality: Use “Main model” (3.98B params) or omit MatFormer configuration entirely

The slice selection allows you to:

Reduce memory usage proportionally to the parameter count
Speed up inference roughly linearly with the number of layers
Maintain acceptable quality for many use cases with smaller slices

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image: Mount Washington

Credit

Prompt:

Please describe this image in detail.

Output:

The image captures a breathtaking, wide-angle view of a majestic mountain covered in a blanket of snow. The mountain dominates the frame, its peak reaching towards a partly cloudy sky. The snow cover is uneven, with patches of exposed dark rock and textured snow formations creating a visually interesting surface. 

A winding, snow-covered path or road snakes its way up the mountainside, appearing as a bright white line against the darker slopes. This path draws the eye upwards towards the summit, where a few structures, possibly communication towers or observation points, are visible. 

The lower slopes of the mountain are covered in a dense forest of evergreen trees, their dark green hues contrasting beautifully with the white snow. The forest extends down into a valley, hinting at a wider landscape beyond the frame. 

The sky above is a mix of pale blue and soft grey clouds, with some darker, more dramatic cloud formations near the top of the mountain. The lighting suggests it might be early morning or late afternoon, casting subtle shadows across the mountain's surface and highlighting its contours. 

The overall impression is one of grandeur, tranquility, and the raw beauty of a winter landscape. The scale of the mountain is impressive, and the winding path invites a sense of exploration and adventure.

Start the server

mistralrs serve vision -p 1234 -m google/gemma-3n-E4B-it

# Or with MatFormer for balanced performance:
mistralrs serve vision -p 1234 -m google/gemma-3n-E4B-it \
  --matformer-config-path matformer_configs/gemma3n.csv \
  --matformer-slice-name "Config for E2.49B (block-level)"

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="ignore",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "Please describe this image in detail.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

This is a minimal example of running the Gemma 3n model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("google/gemma-3n-E4B-it")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "Please describe the image in detail.",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3n-E4B-it",
        arch=VisionArchitecture.Gemma3n,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="ignore",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "Please describe this image in detail.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

OpenAI HTTP API

Audio is delivered with the audio_url content-type that mirrors OpenAIʼs official specification:

{
  "role": "user",
  "content": [
    {
      "type": "audio_url",
      "audio_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg" }
    },
    {
      "type": "image_url",
      "image_url": { "url": "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg" }
    },
    {
      "type": "text",
      "text": "Describe what is happening in this clip in as much detail as possible."
    }
  ]
}

Rust SDK

use anyhow::Result;
use mistralrs::{AudioInput, IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let audio_bytes = reqwest::blocking::get(
        "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg",
    )?
    .bytes()?
    .to_vec();
    let audio = AudioInput::from_bytes(&audio_bytes)?;

    let image_bytes = reqwest::blocking::get(
        "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg",
    )?
    .bytes()?
    .to_vec();
    let image = image::load_from_memory(&image_bytes)?;

    let messages = VisionMessages::new()
        .add_multimodal_message(
            TextMessageRole::User,
            "Describe in detail what is happening.",
            vec![image],
            vec![audio],
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    Ok(())
}

With this, you now have a single-call pipeline that fuses sound, vision, and text – all running locally through mistral.rs! 🔥

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

GLM4 Model

See the GLM4 model Collection

GLM4 is a series of open, multilingual, and multimodal large language models. The text-to-text LLM backbones in GLM4 are supported by mistral.rs.

HTTP API

import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="THUDM/GLM-4-9B-0414",
        arch=Architecture.GLM4,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

GLM-4.7-Flash (MoE): `zai-org/GLM-4.7-Flash`

GLM-4.7-Flash is a mixture of experts (MoE) model from the GLM family with MLA (Multi-head Latent Attention) architecture.

HTTP API

Start the server:

mistralrs serve --isq 4 -p 1234 -m zai-org/GLM-4.7-Flash

Send requests using an OpenAI-compatible client:

import openai

client = openai.Client(base_url="http://localhost:1234/v1", api_key="foobar")

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="zai-org/GLM-4.7-Flash",
        arch=Architecture.GLM4MoeLite,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, TextModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("zai-org/GLM-4.7-Flash")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

GLM-4.7 (MoE): `zai-org/GLM-4.7`

GLM-4.7 is a mixture of experts (MoE) model from the GLM family with standard GQA attention and partial RoPE.

HTTP API

Start the server:

mistralrs serve --isq 4 -p 1234 -m zai-org/GLM-4.7

Send requests using an OpenAI-compatible client:

import openai

client = openai.Client(base_url="http://localhost:1234/v1", api_key="foobar")

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="zai-org/GLM-4.7",
        arch=Architecture.GLM4Moe,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, TextModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("zai-org/GLM-4.7")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

GPT-OSS

GPT-OSS is a Mixture of Experts (MoE) language model with specialized attention mechanisms and efficient quantization. Key features include:

MXFP4 quantized MoE experts for efficient inference
Per-head attention sinks for improved attention patterns
YARN RoPE scaling for extended context
Hybrid cache supporting both full and sliding window attention

mistralrs run -m openai/gpt-oss-20b

Note: GPT-OSS MoE experts are pre-quantized in MXFP4 format. ISQ can be applied to attention layers only.

Note: PagedAttention is not supported for GPT-OSS due to custom attention with sinks.

HTTP API

You can find a more detailed example here.

mistralrs serve -p 1234 -m openai/gpt-oss-20b

import openai

client = openai.OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})

while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

You can find a more detailed example here.

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="openai/gpt-oss-20b",
        arch=Architecture.GptOss,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find a more detailed example here.

use anyhow::Result;
use mistralrs::{TextMessageRole, TextMessages, TextModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("openai/gpt-oss-20b")
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Technical Details

MXFP4 Quantization

GPT-OSS MoE experts use MXFP4 (4-bit microscaling floating point) quantization for compact and efficient storage:

gate_up_proj: Packed experts with MXFP4 weights
down_proj: Packed experts with MXFP4 weights
Scales stored at 1 byte per 32 elements

Attention with Sinks

The model uses per-head attention sinks that are added to attention logits before softmax, helping to regularize attention patterns. This custom attention mechanism is incompatible with PagedAttention.

ISQ Support

In-situ quantization (ISQ) can be applied to attention projection layers:

q_proj, k_proj, v_proj, o_proj
lm_head

MoE expert layers are already MXFP4 quantized and excluded from ISQ.

Qwen 3: `collection`

The Qwen 3 family is a collection of hybrid reasoning MoE and non-MoE models ranging from 0.6b to 235b parameters.

mistralrs run --isq 4 -m Qwen/Qwen3-8B
mistralrs run --isq 4 -m Qwen/Qwen3-30B-A3B

Note: mistral.rs can load all FP8 pre-quantized versions natively! Simply replace the model ID.

Note: tool calling support is fully implemented for the Qwen 3 models, including agentic web search.

Enabling thinking

The Qwen 3 models are hybrid reasoning models which can be controlled at inference-time. By default, reasoning is enabled for these models. To dynamically control this, it is recommended to either add /no_think or /think to your prompt. Alternatively, you can specify the enable_thinking flag as detailed by the API-specific examples.

HTTP API

You can find a more detailed example demonstrating enabling/disabling thinking here.

mistralrs serve --isq 4 -p 1234 -m Qwen/Qwen3-8B

import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
        # enable_thinking=False,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

You can find a more detailed example demonstrating enabling/disabling thinking here.

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="Qwen/Qwen3-8B",
        arch=Architecture.Qwen3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
        # enable_thinking=False,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find a more detailed example demonstrating enabling/disabling thinking here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("Qwen/Qwen3-8B")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        // .enable_thinking(false)
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Qwen 3 Next: `collection`

Qwen3-Coder-Next is a coding-focused language model using a hybrid Gated Delta Network (GDN) + full attention architecture with Mixture of Experts. With only 3B activated parameters (80B total), it achieves performance comparable to models with 10-20x more active parameters. It supports a 256K context window.

Note: mistral.rs can load the FP8 pre-quantized version natively! Simply replace the model ID.

mistralrs run --isq 4 -m Qwen/Qwen3-Coder-Next

GGUF quantized models are also supported:

mistralrs run --format gguf -m Qwen/Qwen3-Coder-Next-GGUF -f <filename>

HTTP API

You can find a more detailed example here.

mistralrs serve --isq 4 -p 1234 -m Qwen/Qwen3-Coder-Next

import openai

client = openai.OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

You can find a more detailed example here.

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="Qwen/Qwen3-Coder-Next",
        arch=Architecture.Qwen3Next,
    ),
    in_situ_quant="Q4K",
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Write a Python function to compute fibonacci numbers."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find a more detailed example here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("Qwen/Qwen3-Coder-Next")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::User,
            "Write a Python function to compute fibonacci numbers.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

SmolLM3: `HuggingFaceTB/SmolLM3-3B`

SmolLM3 is a 3B parameter long-context hybrid reasoning language model. It supports 6 languages, advanced reasoning and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale.

Default, easiest:

mistralrs run --isq 8 -m HuggingFaceTB/SmolLM3-3B

UQFF prequantized:

mistralrs run -m EricB/SmolLM3-3B-UQFF --from-uqff smollm33b-q4k-0.uqff

Note: tool calling support is fully implemented for the SmolLM3 models, including agentic web search.

Check out prequantized UQFF SmolLM3 here: https://huggingface.co/EricB/SmolLM3-3B-UQFF

Enabling thinking

The SmolLM3 models are hybrid reasoning models which can be controlled at inference-time. By default, reasoning is enabled for these models. To dynamically control this, it is recommended to either add /no_think or /think to your prompt. Alternatively, you can specify the enable_thinking flag as detailed by the API-specific examples.

HTTP API

You can find a more detailed example demonstrating enabling/disabling thinking here.

mistralrs serve --isq 8 -p 1234 -m HuggingFaceTB/SmolLM3-3B

import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
        # enable_thinking=False,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

You can find a more detailed example demonstrating enabling/disabling thinking here.

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="HuggingFaceTB/SmolLM3-3B",
        arch=Architecture.SmolLm3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
        # enable_thinking=False,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find a more detailed example demonstrating enabling/disabling thinking here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("HuggingFaceTB/SmolLM3-3B")
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        // .enable_thinking(false)
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Idefics 2 Model: `HuggingFaceM4/idefics2-8b-chatty`

The Idefics 2 Model has support in the Rust, Python, and HTTP APIs. The Idefics 2 Model also supports ISQ for increased performance.

Note: Some of examples use our Cephalo model series but could be used with any model ID.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image:

Prompt:

What is shown in this image?

Output:

The image depicts a group of orange ants climbing over a black pole. The ants are moving in the same direction, forming a line as they ascend the pole.

Start the server

mistralrs serve vision -p 1234 --isq 4 -m HuggingFaceM4/idefics2-8b-chatty

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image?",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

This is a minimal example of running the Idefics 2 model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new(
        "HuggingFaceM4/idefics2-8b-chatty",
    )
    .with_isq(IsqType::Q4K)
    .with_logging()
    .build()
    .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_idefics_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="lamm-mit/Cephalo-Idefics-2-vision-8b-beta",
        arch=VisionArchitecture.Idefics2,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image?",
                    },
                ],
            },
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Idefics 3 Vision: `HuggingFaceM4/Idefics3-8B-Llama3`

Mistral.rs supports the Idefics 3 vision model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.

UQFF quantizations are also available.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. Check the Hugging Face text model config for more information or raise an issue.

Using the 🤗 Smol VLM models

Simply substitute the Idefics 3 model ID (HuggingFaceM4/Idefics3-8B-Llama3) with the Smol VLM one (HuggingFaceTB/SmolVLM-Instruct)!

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

Start up interactive mode with the Idefics 3 model

mistralrs run vision --isq 4 -m HuggingFaceM4/Idefics3-8B-Llama3

Ask a question

> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image depicts a single, large, red rose in full bloom. The rose is positioned against a blurred background that suggests a natural setting, possibly outdoors. The petals of the rose are vividly red with a slight sheen, indicating that they are wet, likely from recent rainfall or dew. The petals are tightly packed and have a velvety texture, which is characteristic of roses. The edges of the petals are slightly curled and appear to be glistening with water droplets, enhancing the overall freshness and beauty of the flower.

The stem of the rose is visible and appears to be green, with a few small thorns scattered along its length. The stem is slender and supports the weight of the large, showy head of the rose. The leaves that accompany the stem are not fully visible in the image but are implied by the presence of the stem.

The background is out of focus, which helps to emphasize the rose as the main subject of the image. The blurred background suggests a natural environment, possibly a garden or a field, with hints of greenery and possibly other flowers or plants. The lighting in the image is natural, likely from sunlight, which casts soft shadows on the petals and adds depth to the scene.

The overall composition of the image focuses on the rose, making it the central point of interest. The wetness of the petals adds a dynamic element to the stillness of the flower, giving it a sense of life and vitality. This could symbolize themes of beauty, nature, and perhaps even passion or love.

In summary, this image captures a single red rose in full bloom with wet petals against a blurred natural background. The rose is the focal point, with its vibrant red color and glistening petals drawing attention. The natural lighting and out-of-focus background enhance the beauty and freshness of the flower.

Continue the chat by passing another image.

> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image depicts a single, large, red rose in full bloom. The rose is positioned against a blurred background that suggests a natural setting, possibly outdoors. The petals of the rose are vividly red with a slight sheen, indicating that they are wet, likely from recent rainfall or dew. The petals are tightly packed and have a velvety texture, which is characteristic of roses. The edges of the petals are slightly curled and appear to be glistening with water droplets, enhancing the overall freshness and beauty of the flower.

The stem of the rose is visible and appears to be green, with a few small thorns scattered along its length. The stem is slender and supports the weight of the large, showy head of the rose. The leaves that accompany the stem are not fully visible in the image but are implied by the presence of the stem.

The background is out of focus, which helps to emphasize the rose as the main subject of the image. The blurred background suggests a natural environment, possibly a garden or a field, with hints of greenery and possibly other flowers or plants. The lighting in the image is natural, likely from sunlight, which casts soft shadows on the petals and adds depth to the scene.

The overall composition of the image focuses on the rose, making it the central point of interest. The wetness of the petals adds a dynamic element to the stillness of the flower, giving it a sense of life and vitality. This could symbolize themes of beauty, nature, and perhaps even passion or love.

In summary, this image captures a single red rose in full bloom with wet petals against a blurred natural background. The rose is the focal point, with its vibrant red color and glistening petals drawing attention. The natural lighting and out-of-focus background enhance the beauty and freshness of the flower.
> \image https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg What mountain is this?
The mountain is Mount Washington.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

The image depicts a majestic mountain landscape under a partly cloudy sky, characterized by its rugged and snow-covered peaks. The mountain is prominently featured in the center of the image, showcasing its expansive and undulating terrain. The summit of the mountain is capped with snow, indicating that it might be winter or early springtime.

The slopes of the mountain are steep and uneven, covered with patches of snow that appear to have been recently fallen or freshly groomed for skiing or other winter activities. There are visible ski trails descending from the summit down towards what seems to be a valley below, suggesting that this location could be a popular ski resort area.

In addition to the main peak, there are smaller hills and ridges surrounding it on both sides. These secondary peaks also have varying degrees of snow cover but appear less prominent than the central peak.

The sky above is mostly overcast with clouds covering most parts but allowing some sunlight to peek through in certain areas, casting soft shadows on parts of the mountainside. This lighting suggests that it might not be midday yet as there isn't an intense brightness typical for noon hours.

On closer inspection near one side of this grandeur scene stands tall trees without leaves; their bare branches starkly contrasting against both white snow and blue sky create an interesting... (cut off)

Start the server

mistralrs serve vision -p 1234 --isq 4 -m HuggingFaceM4/Idefics3-8B-Llama3

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "HuggingFaceM4/Idefics3-8B-Llama3";

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new(MODEL_ID)
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .build()
        .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="HuggingFaceM4/Idefics3-8B-Llama3",
        arch=VisionArchitecture.Idefics3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image?",
                    },
                ],
            },
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

UQFF models

Coming soon!

LLaVA and LLaVANext Model: `llava-hf model family`

The LLaVA and LLaVANext are great multimodal models that can handle both text and vision inputs.

This implementation supports both LLaVA and LLaVANext(which adds multi resolution image processing) and two types of LLM base model: llama and mistral. Currently it is tested on:

llava-hf/llava-v1.6-mistral-7b-hf
llava-hf/llava-v1.6-vicuna-7b-hf
llava-hf/llava-1.5-7b-hf

The LLaVA and LLaVANext Model has support in the Rust, Python, and HTTP APIs. The LLaVA and LLaVANext Model also supports ISQ for increased performance.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image: Mount Washington

Credit

Prompt:

What is shown in this image?

Output:

Text: The image shows a steep, snow-covered hillside with a pine tree on the right side, close to the top. The landscape appears to be a mountainous area with winter conditions. There are no visible humans or permanent structures in the immediate vicinity that suggest this is a summer or recreational location. It's likely a cold, snowy day or season, and the slopes might be part of a mountainous region.

Start the server

mistralrs serve vision -p 1234 --isq 4 -m llava-hf/llava-v1.6-mistral-7b-hf
# or for vicuna backend, specify the chat template:
mistralrs serve vision -p 1234 --isq 4 -c ./chat_templates/vicuna.json -m llava-hf/llava-v1.6-vicuna-7b-hf

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image?",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

This is a minimal example of running the LLaVA and LLaVANext model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new(
        "llava-hf/llava-v1.6-mistral-7b-hf",
    )
    .with_isq(IsqType::Q4K)
    .with_logging()
    .build()
    .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_llava_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="llava-hf/llava-v1.6-mistral-7b-hf",
        arch=VisionArchitecture.LLaVANext,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image?",
                    },
                ],
            },
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Llama 3.2 Vision Model: `meta-llama/Llama-3.2-11B-Vision-Instruct`

Mistral.rs supports the Llama 3.2 vision model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.

UQFF quantizations are also available.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: Some examples use the Cephalo Llama 3.2 model, a member of the Cephalo model collection. This model is finetune of Llama 3.2 with enhanced capabilities in scientific images. To use the base Llama 3.2 Vision model, simply use the associated model ID.

Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. The text model has 40 layers.

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

https://github.com/user-attachments/assets/4d11c35c-9ea2-42b8-8cab-5f7e8e2ee9ff

Start up interactive mode with the Llama 3.2 model

mistralrs run vision --isq 4 -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k

Say hello!

> Hello!
How can I assist you today?

Pass the model an image and ask a question.

> Hello!
How can I assist you today?
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image shows a close-up view of a rose flower with dew drops on its petals. The rose is in full bloom, with its petals unfolding and displaying vibrant pink coloration. The dew drops on the petals create a delicate, glistening effect, adding to the overall visual appeal of the flower. The background is blurred, focusing attention on the intricate details of the rose.

Continue the chat by passing another image.

> Hello!
How can I assist you today?
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image shows a close-up view of a rose flower with dew drops on its petals. The rose is in full bloom, with its petals unfolding and displaying vibrant pink coloration. The dew drops on the petals create a delicate, glistening effect, adding to the overall visual appeal of the flower. The background is blurred, focusing attention on the intricate details of the rose.
> \image https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg What mountain is this?
The image appears to be of Mount Washington, which is the highest peak in the Northeastern United States. It is located in the White Mountains of New Hampshire and is known for its extreme weather conditions, including high winds and low temperatures. The mountain's summit reaches an elevation of approximately 6,288 feet (1,917 meters) above sea level.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

The image shows Mount Washington, the highest peak in the Northeastern United States, located in the White Mountains of New Hampshire. The scene captures the mountain's rugged terrain and varied landscape features. 

In the foreground, there are dense forests of coniferous trees, primarily spruce and fir, which are typical of the region's boreal forest ecosystem. The trees are densely packed, indicating a high level of vegetation cover and biodiversity.

Moving upwards, the image reveals rocky outcroppings and boulders scattered across the slope, indicating the mountain's geological history of glacial activity. The presence of these rocks suggests that the area was once covered by ice sheets during the last ice age, which carved out the landscape and left behind a mix of boulders and talus slopes.

In the mid-ground, the image shows a series of ridges and valleys, which are characteristic of the mountain's glacially sculpted terrain. These features were formed by the movement of ice sheets that carved out U-shaped valleys and left behind a series of rounded hills and ridges.

At the summit, there is a prominent observation tower or weather station, which is likely used for scientific research and weather monitoring. The structure is situated at an elevation of approximately 6,288 feet (1,917 meters) above sea level, making it one of the highest points in the region.

The image also captures the atmospheric conditions on Mount Washington, with clouds and mist visible in the background. The mountain's unique location in a region where cold Arctic air meets warm moist air from the Gulf Stream creates a unique microclimate known as the "Home Rule," where extreme weather conditions can occur.

Overall, the image showcases the diverse geological and ecological features of Mount Washington, highlighting its role as a significant natural landmark in the Northeastern United States.

Start the server

mistralrs serve vision -p 1234 --isq 4 -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k";

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new(MODEL_ID)
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

MODEL_ID = "lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k"

runner = Runner(
    which=Which.VisionPlain(
        model_id=MODEL_ID,
        arch=VisionArchitecture.VLlama,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

UQFF models

UQFF is a quantized file format similar to GGUF based on ISQ. It removes the memory and compute requirements that come with ISQ by providing ready-made quantizations! The key advantage over GGUF is the flexibility to store multiple quantizations in one file.

We provide UQFF files (EricB/Llama-3.2-11B-Vision-Instruct-UQFF) for this Llama 3.2 Vision model.

You can use these UQFF files to easily use quantized versions of Llama 3.2 Vision.

For example:

mistralrs run -m meta-llama/Llama-3.2-11B-Vision-Instruct --from-uqff EricB/Llama-3.2-11B-Vision-Instruct-UQFF/llama-3.2-11b-vision-q4k.uqff

Llama 4 Series: `meta-llama/Llama-4-Scout-17B-16E-Instruct`

🚧 We are preparing a collection of UQFF quantized models! 🚧

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

Architecture:

Efficient inference: 17B activated parameters
Very sparse: 1 activated expert for both Scout (of 16), and Maverick (of 128)
RoPE enhancement: iRoPE enables high context-length functionality

Integration in mistral.rs:

Tool calling + Automatic web search
ISQ
Rust, Python and HTTP APIs

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image:

Credit

Prompt:

Please describe this image in detail.

Output:

The image presents a breathtaking mountain landscape, with a snow-capped peak dominating the scene. The mountain's rugged terrain is characterized by numerous ridges and valleys, while its summit is adorned with several structures that appear to be communication towers or antennas.

**Key Features:**

* **Mountain:** The mountain is the central focus of the image, showcasing a mix of snow-covered and bare areas.
* **Sky:** The sky above the mountain features a dramatic display of clouds, with dark grey clouds at the top gradually giving way to lighter blue skies towards the bottom.
* **Valley:** In the foreground, a valley stretches out, covered in trees that are mostly bare, suggesting a winter setting.
* **Lighting:** The lighting in the image is striking, with the sun casting a warm glow on the mountain's snow-covered slopes while leaving the surrounding areas in shadow.

**Overall Impression:**

The image exudes a sense of serenity and majesty, capturing the beauty of nature in a dramatic and awe-inspiring way. The contrast between the snow-covered mountain and the bare trees in the valley creates a visually appealing scene that invites the viewer to appreciate the natural world.

Start the server

mistralrs serve vision -p 1234 --isq 4 -m meta-llama/Llama-4-Scout-17B-16E-Instruct

Send a request

from openai import OpenAI
import httpx
import textwrap
import json


client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")


completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "Please describe this image in detail.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

This is a minimal example of running the Llama 4 model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new(
        "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    )
    .with_isq(IsqType::Q4K)
    .with_logging()
    .build()
    .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is this?",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        arch=VisionArchitecture.Llama4,
    ),
    in_situ_quant="4",
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is this?",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

MiniCPM-O 2.6 Model: `openbmb/MiniCPM-o-2_6`

Mistral.rs supports the MiniCPM-O 2.6 model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.

UQFF quantizations are coming soon.

Note

Only the vision portion of this model has been implemented. No audio features are supported yet.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

Start up interactive mode with the MiniCPM-O 2.6 Model model

mistralrs run vision --isq 4 -m openbmb/MiniCPM-o-2_6

Say hello!

> Hello!
How can I assist you today?

Pass the model an image and ask a question.

> Hello!
How can I assist you today?
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image shows a close-up view of a rose flower with dew drops on its petals. The rose is in full bloom, with its petals unfolding and displaying vibrant pink coloration. The dew drops on the petals create a delicate, glistening effect, adding to the overall visual appeal of the flower. The background is blurred, focusing attention on the intricate details of the rose.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

The image shows Mount Washington, the highest peak in the Northeastern United States, located in the White Mountains of New Hampshire. The scene captures the mountain's rugged terrain and varied landscape features. 

In the foreground, there are dense forests of coniferous trees, primarily spruce and fir, which are typical of the region's boreal forest ecosystem. The trees are densely packed, indicating a high level of vegetation cover and biodiversity.

Moving upwards, the image reveals rocky outcroppings and boulders scattered across the slope, indicating the mountain's geological history of glacial activity. The presence of these rocks suggests that the area was once covered by ice sheets during the last ice age, which carved out the landscape and left behind a mix of boulders and talus slopes.

In the mid-ground, the image shows a series of ridges and valleys, which are characteristic of the mountain's glacially sculpted terrain. These features were formed by the movement of ice sheets that carved out U-shaped valleys and left behind a series of rounded hills and ridges.

At the summit, there is a prominent observation tower or weather station, which is likely used for scientific research and weather monitoring. The structure is situated at an elevation of approximately 6,288 feet (1,917 meters) above sea level, making it one of the highest points in the region.

The image also captures the atmospheric conditions on Mount Washington, with clouds and mist visible in the background. The mountain's unique location in a region where cold Arctic air meets warm moist air from the Gulf Stream creates a unique microclimate known as the "Home Rule," where extreme weather conditions can occur.

Overall, the image showcases the diverse geological and ecological features of Mount Washington, highlighting its role as a significant natural landmark in the Northeastern United States.

Start the server

mistralrs serve vision -p 1234 --isq 4 -m openbmb/MiniCPM-o-2_6

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "openbmb/MiniCPM-o-2_6";

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new(MODEL_ID)
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

MODEL_ID = "openbmb/MiniCPM-o-2_6"

runner = Runner(
    which=Which.VisionPlain(
        model_id=MODEL_ID,
        arch=VisionArchitecture.MiniCpmO,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Mistral Small 3.1 Model: `mistralai/Mistral-Small-3.1-24B-Instruct-2503`

The Mistral Small 3.1 model is a strong multimodal (text+vision) model with 128k context length, function calling, and strong visual understanding.

We support the Mistral 3 Model in the Rust, Python, and HTTP APIs, including ISQ for increased performance.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

Tool calling with Mistral Small 3.1

The Mistral Small 3.1 model itself does not come with the correct JINJA chat template to enable tool calling. We provide a chat template for tool calling with Mistral Small 3.1, and you can use it by specifying the jinja_explicit parameter in the various APIs. For example:

mistralrs serve -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image:

Credit

Prompt:

What is this?

Output:

The image shows a close-up of a vibrant flower with pink petals and a central cluster of yellowish-brown stamens. This flower appears to be from the genus *Gazania*, commonly known as treasure flowers or gazanias. These flowers are known for their daisy-like appearance and bright colors.

Gazania flowers typically have ray florets (the petal-like structures) that can change color based on light conditions—often appearing more vibrant in direct sunlight. They are popular in gardens for their hardiness and ability to thrive in sunny locations with well-drained soil.

If there's anything specific about this flower or its care that interests you further, feel free to ask!

Start the server

mistralrs serve vision -p 1234 -m mistralai/Mistral-Small-3.1-24B-Instruct-2503

Send a request

from openai import OpenAI
import httpx
import textwrap
import json


client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")


completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/f/fd/Pink_flower.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is this?",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

This is a minimal example of running the Mistral 3 model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
        arch=VisionArchitecture.Mistral3,
    ),
    in_situ_quant="4"
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is this?",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Phi 3.5 Model: `microsoft/Phi-3.5-MoE-instruct`

The Phi 3.5 MoE model is a 16x3.8B parameter decoder-only text-to-text mixture of expert LLM.

Context length of 128k tokens
Trained on 4.9T tokens
16 experts (16x3.8B parameters) with 6.6B active parameters
Expect inference performance of a 7B model

About the MoE mechanism

Compute router gating logits
From the router gating logits, select the top-2 selected experts and the associated weights
The hidden states for each token in the sequence is computed by (if selected) applying the expert output to that token, and then weighting it.
- If multiple experts are selected for the token, then this becomes a weighted sum
- The design is flexible: 2 or 1 experts can be selected, enabling dense or sparse gating

mistralrs run --isq 4 -m microsoft/Phi-3.5-MoE-instruct

Note

This models supports MoQE which can be activated in the ISQ organization parameter within the various APIs, as demonstrated below:

mistralrs run --isq 4 -m microsoft/Phi-3.5-MoE-instruct --isq-organization moqe

HTTP API

mistralrs serve --isq 4 -p 1234 -m microsoft/Phi-3.5-MoE-instruct

import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="microsoft/Phi-3.5-MoE-instruct",
        arch=Architecture.Phi3_5MoE ,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-MoE-instruct")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Phi 3 Vision Model: `microsoft/Phi-3.5-vision-instruct`

The Phi 3 Vision Model has support in the Rust, Python, and HTTP APIs. The Phi 3 Vision Model supports ISQ for increased performance.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: The Phi 3 Vision model works best with one image although it is supported to send multiple images.

Note: when sending multiple images, they will be resized to the minimum dimension by which all will fit without cropping. Aspect ratio is not preserved in that case.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

The image captures a breathtaking view of a mountain peak, bathed in the soft glow of sunlight. The peak, dusted with a layer of snow, stands tall against the backdrop of a clear blue sky. A trail, etched into the mountain's side by countless hikers before it, winds its way up to the summit. The trail's white color contrasts sharply with the surrounding landscape, drawing attention to its path and inviting exploration.

The perspective from which this photo is taken offers an expansive view of the mountain and its surroundings. It seems as if one could look down from this vantage point and see miles upon miles of untouched wilderness stretching out into the distance. The colors in the image are predominantly blue and white, reflecting both sky and snow-covered mountains respectively. However, there are also hints of green from trees dotting lower parts of mountainsides or valleys below them - adding another layer to this picturesque scene. This serene landscape evokes feelings of tranquility and adventure at once - an invitation to explore nature's grandeur while respecting its majesty at all times!

Start the server

mistralrs serve vision -p 1234 -m microsoft/Phi-3.5-vision-instruct

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

This is a minimal example of running the Phi 3 Vision model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("microsoft/Phi-3.5-vision-instruct")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_phiv_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="microsoft/Phi-3.5-vision-instruct",
        arch=VisionArchitecture.Phi3V,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/e/e7/ Everest_North_Face_toward_Base_Camp_Tibet_Luca_Galuzzi_2006.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Phi 4 Multimodal Model: `microsoft/Phi-4-multimodal-instruct`

The Phi 4 Multimodal Model has support in the Rust, Python, and HTTP APIs. The Phi 4 Multimodal Model supports ISQ for increased performance.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: The Phi 4 Multimodal model works best with one image although it is supported to send multiple images.

Note: when sending multiple images, they will be resized to the minimum dimension by which all will fit without cropping. Aspect ratio is not preserved in that case.

Phi 4 multimodal supports audio inputs!.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

A mountain with snow on it.

Start the server

mistralrs serve vision -p 1234 -m microsoft/Phi-4-multimodal-instruct

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

This is a minimal example of running the Phi 4 Multimodal model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("microsoft/Phi-4-multimodal-instruct")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="microsoft/Phi-4-multimodal-instruct",
        arch=VisionArchitecture.Phi4MM,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/e/e7/Everest_North_Face_toward_Base_Camp_Tibet_Luca_Galuzzi_2006.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Audio input

Alongside vision, Phi 4 Multimodal in mistral.rs can accept audio as an additional modality. This unlocks fully-local pipelines such as text + speech + vision -> text where the model can reason jointly over what it hears and what it sees.

mistral.rs automatically decodes the supplied audio (WAV/MP3/FLAC/OGG/… – anything Symphonia can handle) into 16-bit PCM.

OpenAI HTTP API

Audio is delivered with the audio_url content-type that mirrors OpenAIʼs official specification:

{
  "role": "user",
  "content": [
    {
      "type": "audio_url",
      "audio_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg" }
    },
    {
      "type": "image_url",
      "image_url": { "url": "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg" }
    },
    {
      "type": "text",
      "text": "Describe what is happening in this clip in as much detail as possible."
    }
  ]
}

Rust SDK

use anyhow::Result;
use mistralrs::{AudioInput, IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new("microsoft/Phi-4-multimodal-instruct")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let audio_bytes = reqwest::blocking::get(
        "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg",
    )?
    .bytes()?
    .to_vec();
    let audio = AudioInput::from_bytes(&audio_bytes)?;

    let image_bytes = reqwest::blocking::get(
        "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg",
    )?
    .bytes()?
    .to_vec();
    let image = image::load_from_memory(&image_bytes)?;

    let messages = VisionMessages::new()
        .add_multimodal_message(
            TextMessageRole::User,
            "Describe in detail what is happening.",
            vec![image],
            vec![audio],
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    Ok(())
}

With this, you now have a single-call pipeline that fuses sound, vision, and text – all running locally through mistral.rs! 🔥

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Qwen 2 Vision Model: `Qwen2-VL Collection`

Mistral.rs supports the Qwen2-VL vision model family, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.

UQFF quantizations are also available.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. The text model has 28 layers.

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

Start up interactive mode with the Qwen2-VL model

mistralrs run vision -m Qwen/Qwen2-VL-2B-Instruct

Say hello!

> Hello!
Hello! How can I assist you today?

Pass the model an image and ask a question.

> Hello!
Hello! How can I assist you today?
> \image https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg What type of flower is this? Give some fun facts.
flowers are a type of flowering plant that produce flowers that are typically used for decoration, pollination, and reproduction. there are many different types of flowers, each with its own unique characteristics and uses. here are some fun facts about camellias:

  * camellias are native to china and have been cultivated for over 2,000 years.
  * camellias are known for their long blooming season, with some varieties blooming continuously for months.
  * camellias come in a wide variety of colors, including red, pink, white, and yellow.
  * camellias are also known for their fragrant blooms, which can be enjoyed by both humans and animals.
  * camellias are often used in gardens and parks as a decorative element, and are also popular in landscaping and horticulture.

camellias are also known for their resilience and ability to thrive in a variety of conditions, making them a popular choice for gardeners and landscapers. they require well-draining soil and full sun or partial shade, and can be grown in containers or in the ground. overall, camellias are a beautiful and versatile flower that can add beauty and interest to any garden or landscape.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image: Mount Washington

Prompt:

What type of flower is this? Give some fun facts.

Output:

flowers are a beautiful addition to any garden or outdoor space. They come in many different colors and shapes, and can be used for both decorative purposes and as sources of pollination for bees and other insects.

One fun fact about camellias is that they are native to Japan, but were introduced to Europe in the 17th century by Portuguese sailors who brought them back from their voyages around the world. Camellias have been popular as ornamental plants since then, with many varieties available for cultivation.

Camellias also have interesting cultural significance in Japan, where they are often associated with good fortune and prosperity. In Chinese culture, camellias symbolize longevity and immortality.
In conclusion, camellias are beautiful flowers that add color and interest to gardens or outdoor spaces. They come in many different colors and shapes, making them a popular choice for gardeners everywhere!

Start the server

mistralrs serve vision -p 1234 -m Qwen/Qwen2-VL-2B-Instruct

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What type of flower is this? Give some fun facts.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "Qwen/Qwen2-VL-2B-Instruct";

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new(MODEL_ID)
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What type of flower is this? Give some fun facts.",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

MODEL_ID = "Qwen/Qwen2-VL-2B-Instruct"

runner = Runner(
    which=Which.VisionPlain(
        model_id=MODEL_ID,
        arch=VisionArchitecture.Qwen2VL,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What type of flower is this? Give some fun facts.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Qwen 3 Vision Model: `Qwen3 VL Collection`

The Qwen 3 VL models are the successors to the Qwen 2.5 VL models, featuring a diverse lineup of increased performance, flexible sizes, and reasoning-capable models.

Mistral.rs supports the Qwen 3 VL vision model family (including MoE variants), with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements. MoE variants also support MoQE via the --organization moqe flag.

UQFF quantizations are also available.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters.

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

Start up interactive mode with the Qwen3 VL model:

mistralrs run vision -m Qwen/Qwen3-VL-4B-Instruct

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Start the server

mistralrs serve vision -p 1234 -m Qwen/Qwen3-VL-4B-Instruct

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What type of flower is this? Give some fun facts.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new("Qwen/Qwen3-VL-4B-Instruct")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is this?",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

MODEL_ID = "Qwen/Qwen3-VL-4B-Thinking"

runner = Runner(
    which=Which.VisionPlain(
        model_id=MODEL_ID,
        arch=VisionArchitecture.Qwen3VL,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What type of flower is this? Give some fun facts.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Qwen 3.5 Vision Models

The Qwen 3.5 models are vision-language models using a hybrid Gated Delta Network (GDN) + full attention architecture. Both dense and MoE (Mixture of Experts) variants are supported:

Dense: Qwen/Qwen3.5-27B
MoE: Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-397B-A17B

Mistral.rs supports the Qwen 3.5 vision model family with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements. MoE variants also support MoQE via the --organization moqe flag.

UQFF quantizations are also available.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters.

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

Start up interactive mode with the Qwen 3.5 model (dense):

mistralrs run vision -m Qwen/Qwen3.5-27B

Or with the MoE variant:

mistralrs run vision -m Qwen/Qwen3.5-35B-A3B

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Start the server

mistralrs serve vision -p 1234 -m Qwen/Qwen3.5-27B

Send a request

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What type of flower is this? Give some fun facts.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new("Qwen/Qwen3.5-27B")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is this?",
        vec![image],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

# Dense variant
MODEL_ID = "Qwen/Qwen3.5-27B"

runner = Runner(
    which=Which.VisionPlain(
        model_id=MODEL_ID,
        arch=VisionArchitecture.Qwen3_5,
    ),
)

# For MoE variant, use:
# MODEL_ID = "Qwen/Qwen3.5-35B-A3B"
# runner = Runner(
#     which=Which.VisionPlain(
#         model_id=MODEL_ID,
#         arch=VisionArchitecture.Qwen3_5Moe,
#     ),
# )

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What type of flower is this? Give some fun facts.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

FLUX.1 Model: `black-forest-labs/FLUX.1-schnell`

The FLUX model is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions.

We support both the -schnell and -dev versions of the model.

Memory usage

The FLUX model itself is 12 billion parameters (~24GB), and the T5 XXL encoder model it uses requires ~9GB. We support loading the models fully onto the GPU, which allows much faster inference. If you do not have enough memory, try the offloaded (-offloaded or -Offloaded) model types. These will load the model on the CPU but perform computations on the GPU.

Type	Memory requirement	Generation Time (s), A100
Normal	~33GB	9.4
Offloaded	~4GB	92.7

HTTP server

The OpenAI HTTP server provides a compatible way to easily use this implementation. As per the specification, output images can be returned as local paths to images or be encoded to base64.

mistralrs serve diffusion -p 1234 -m black-forest-labs/FLUX.1-schnell -a flux

After this, you can send requests via the HTTP server:

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

result = client.images.generate(
    model="default",
    prompt="A vibrant sunset in the mountains, 4k, high quality.",
    n=1,
)
print(result.data[0].url)

Rust example

use std::time::Instant;
use anyhow::Result;
use mistralrs::{DiffusionGenerationParams, DiffusionLoaderType, DiffusionModelBuilder, ImageGenerationResponseFormat};

#[tokio::main]
async fn main() -> Result<()> {
    let model = DiffusionModelBuilder::new(
        "black-forest-labs/FLUX.1-schnell",
        DiffusionLoaderType::FluxOffloaded,
    )
    .with_logging()
    .build()
    .await?;

    let start = Instant::now();

    let response = model
        .generate_image(
            "A vibrant sunset in the mountains, 4k, high quality.".to_string(),
            ImageGenerationResponseFormat::Url,
            DiffusionGenerationParams::default(),
        )
        .await?;

    let finished = Instant::now();

    println!(
        "Done! Took {} s. Image saved at: {}",
        finished.duration_since(start).as_secs_f32(),
        response.data[0].url.as_ref().unwrap()
    );

    Ok(())
}

Python example

from mistralrs import (
    Runner,
    Which,
    DiffusionArchitecture,
    ImageGenerationResponseFormat,
)

runner = Runner(
    which=Which.DiffusionPlain(
        model_id="black-forest-labs/FLUX.1-schnell",
        arch=DiffusionArchitecture.FluxOffloaded,
    ),
)

res = runner.generate_image(
    "A vibrant sunset in the mountains, 4k, high quality.",
    ImageGenerationResponseFormat.Url,
)
print(res.choices[0].url)

Voxtral Model: `mistralai/Voxtral-Mini-4B-Realtime-2602`

Voxtral Mini is a 4.4B parameter real-time automatic speech recognition (ASR) model created by Mistral AI. It features a causal Whisper-based audio encoder, a temporal adapter, and a Mistral decoder. The model accepts audio input and produces text output (speech-to-text).

The Voxtral Model has support in the Rust, Python, and HTTP APIs. Additionally, the Voxtral Model supports ISQ for increased performance.

Note: Voxtral uses Mistral’s native format (params.json, consolidated.safetensors, tekken.json), which mistral.rs handles automatically.

HTTP server

We support an OpenAI compatible HTTP API for audio models.

Start the server

mistralrs serve vision -m mistralai/Voxtral-Mini-4B-Realtime-2602

Send a request

import base64
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

# Load a local audio file
with open("audio.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode("utf-8")

completion = client.chat.completions.create(
    model="ignore",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": f"data:audio/wav;base64,{audio_b64}"
                    },
                },
                {
                    "type": "text",
                    "text": "Transcribe this audio.",
                },
            ],
        },
    ],
    max_tokens=256,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{AudioInput, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new("mistralai/Voxtral-Mini-4B-Realtime-2602")
        .with_logging()
        .build()
        .await?;

    let audio_bytes = std::fs::read("sample_audio.wav")?;
    let audio = AudioInput::from_bytes(&audio_bytes)?;

    let messages = VisionMessages::new().add_multimodal_message(
        TextMessageRole::User,
        "Transcribe this audio.",
        vec![],
        vec![audio],
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="mistralai/Voxtral-Mini-4B-Realtime-2602",
        arch=VisionArchitecture.Voxtral,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="ignore",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "audio_url",
                        "audio_url": {
                            "url": "path/to/audio.wav"
                        },
                    },
                    {
                        "type": "text",
                        "text": "Transcribe this audio.",
                    },
                ],
            }
        ],
        max_tokens=256,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Dia 1.6b Model: `nari-labs/Dia-1.6B`

Dia is a 1.6B parameter text to speech model created by Nari Labs. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.

Generate dialogue via the [S1] and [S2] tags
Generate non-verbal like (laughs), (coughs), etc.
Below verbal tags will be recognized, but might result in unexpected output. (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)

Note: voice cloning support is coming!

HTTP server

The OpenAI HTTP server provides a drop-in compatible way to easily use Dia locally!

Note: we only support pcm and wav outputs.

mistralrs run speech -m nari-labs/Dia-1.6B -a dia

After this, you can send requests via the HTTP server:

from pathlib import Path
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

# text_to_speak = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
text_to_speak = "[S1] mistral r s is a local LLM inference engine. [S2] You can run text and vision models, and also image generation and speech generation. [S1] There is agentic web search, tool calling, and a convenient Python SDK. [S2] Check it out on github."

response = client.audio.speech.create(
    model="default", voice="N/A", input=text_to_speak, response_format="wav"
)

output_path = Path("output.wav")
output_path.write_bytes(response.read())
print(f"WAV audio written to {output_path.resolve()}")

Rust example

use std::time::Instant;

use anyhow::Result;
use mistralrs::{speech_utils, SpeechLoaderType, SpeechModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = SpeechModelBuilder::new("nari-labs/Dia-1.6B", SpeechLoaderType::Dia)
        .with_logging()
        .build()
        .await?;

    let start = Instant::now();

    // let text_to_speak = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face.";
    let text_to_speak = "[S1] mistral r s is a local LLM inference engine. [S2] You can run text and vision models, and also image generation and speech generation. [S1] There is agentic web search, tool calling, and a convenient Python SDK. [S2] Check it out on github.";

    let (pcm, rate, channels) = model.generate_speech(text_to_speak).await?;

    let finished = Instant::now();

    let mut output = std::fs::File::create("out.wav").unwrap();
    speech_utils::write_pcm_as_wav(&mut output, &pcm, rate as u32, channels as u16).unwrap();

    println!(
        "Done! Took {} s. Audio saved at `out.wav`.",
        finished.duration_since(start).as_secs_f32(),
    );

    Ok(())
}

Python example

from mistralrs import (
    Runner,
    Which,
    SpeechLoaderType,
)
from pathlib import Path
import wave, struct

# text_to_speak = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
text_to_speak = "[S1] mistral r s is a local LLM inference engine. [S2] You can run text and vision models, and also image generation and speech generation. [S1] There is agentic web search, tool calling, and a convenient Python SDK. [S2] Check it out on github."

runner = Runner(
    which=Which.Speech(
        model_id="nari-labs/Dia-1.6B",
        arch=SpeechLoaderType.Dia,
    ),
)

res = runner.generate_speech(text_to_speak)
print(res.choices[0].url)

pcm_data = res.pcm  # list of floats between -1.0 and 1.0
output_path = Path("output.wav")

# convert floats to 16-bit PCM ints
pcm_ints = [int(max(-32768, min(32767, int(sample * 32767)))) for sample in pcm_data]
with wave.open(output_path, "wb") as wf:
    wf.setnchannels(1)  # mono
    wf.setsampwidth(2 * res.channels)  # 2 bytes per sample (16-bit)
    wf.setframerate(res.rate)  # set sample rate (adjust if needed)
    wf.writeframes(b"".join(struct.pack("<h", s) for s in pcm_ints))

print(f"WAV audio written to {output_path.resolve()}")

EmbeddingGemma

EmbeddingGemma was the first embedding model supported by mistral.rs. This guide walks through serving the model via the OpenAI-compatible HTTP server, running it from Python, and embedding text directly in Rust.

For a catalog of available embedding models and general usage tips, see EMBEDDINGS.md.

Prompt instructions

EmbeddingGemma can generate optimized embeddings for various use cases-such as document retrieval, question answering, and fact verification-or for specific input types, either, a query or a document-using prompts that are prepended to the input strings.

Query prompts follow the form task: {task description} | query: where the task description varies by the use case, with the default task description being search result.
Document-style prompts follow the form title: {title | "none"} | text: where the title is either none (the default) or the actual title of the document. Note that providing a title, if available, will improve model performance for document prompts but may require manual formatting.

Use Case (task type enum)	Descriptions	Recommended Prompt
Retrieval (Query)	Used to generate embeddings that are optimized for document search or information retrieval.	`task: search result \| query: {content}`
Retrieval (Document)	Used to generate embeddings that are optimized for document search or information retrieval (document side).	`title: {title \| "none"} \| text: {content}`
Question Answering	Used to generate embeddings that are optimized for answering natural language questions.	`task: question answering \| query: {content}`
Fact Verification	Used to generate embeddings that are optimized for verifying factual correctness.	`task: fact checking \| query: {content}`
Classification	Used to generate embeddings that are optimized to classify texts according to preset labels.	`task: classification \| query: {content}`
Clustering	Used to generate embeddings that are optimized to cluster texts based on their similarities.	`task: clustering \| query: {content}`
Semantic Similarity	Used to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases.	`task: sentence similarity \| query: {content}`
Code Retrieval	Used to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list. Embeddings of code blocks are computed using `retrieval_document`.	`task: code retrieval \| query: {content}`

HTTP server

Launch the server in embedding mode to expose an OpenAI-compatible /v1/embeddings endpoint:

mistralrs serve -p 1234 -m google/embeddinggemma-300m

Once running, call the endpoint with an OpenAI client or raw curl:

curl http://localhost:1234/v1/embeddings \
  -H "Authorization: Bearer EMPTY" \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "input": ["task: search result | query: What is graphene?", "task: search result | query: What is an apple?"]}'

An example with the OpenAI client can be found here.

By default the server registers the model as default. To expose it under a custom name or alongside chat models, run in multi-model mode and assign an identifier in the selector configuration:

{
  "embed-gemma": {
    "Embedding": {
      "model_id": "google/embeddinggemma-300m",
      "arch": "embeddinggemma"
    }
  }
}

See docs/HTTP.md for the full request schema and response layout.

Python SDK

Instantiate Runner with the Which.Embedding selector and request EmbeddingGemma explicitly. The helper method send_embedding_request returns batched embeddings as Python lists.

from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which

runner = Runner(
    which=Which.Embedding(
        model_id="google/embeddinggemma-300m",
        arch=EmbeddingArchitecture.EmbeddingGemma,
    )
)

request = EmbeddingRequest(
    input=["task: search result | query: What is graphene?", "task: search result | query: What is an apple?"],
    truncate_sequence=True,
)

embeddings = runner.send_embedding_request(request)
print(len(embeddings), len(embeddings[0]))

Refer to this example for a complete runnable script.

Rust SDK

Use the EmbeddingModelBuilder helper from the mistralrs crate to create the model and submit an EmbeddingRequest:

use anyhow::Result;
use mistralrs::{EmbeddingModelBuilder, EmbeddingRequest};

#[tokio::main]
async fn main() -> Result<()> {
    let model = EmbeddingModelBuilder::new("google/embeddinggemma-300m")
        .with_logging()
        .build()
        .await?;

    let embeddings = model
        .generate_embeddings(
            EmbeddingRequest::builder()
                .add_prompt("task: search result | query: What is graphene?")
        )
        .await?;

    println!("Returned {} vectors", embeddings.len());
    Ok(())
}

This example lives here, and can be run with:

cargo run --package mistralrs --example embedding_gemma

Qwen3 Embedding

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks.

For a catalog of all embedding backends, see EMBEDDINGS.md.

HTTP server

Serve the model with the OpenAI-compatible endpoint enabled:

mistralrs serve -p 1234 -m Qwen/Qwen3-Embedding-0.6B

Call the endpoint via curl or the OpenAI SDK:

curl http://localhost:1234/v1/embeddings \
  -H "Authorization: Bearer EMPTY" \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "input": ["Graphene conductivity", "Explain superconductors in simple terms."]}'

An example with the OpenAI client can be found here.

To expose the model alongside chat models, register it in your selector configuration using the qwen3embedding architecture tag:

{
  "embed-qwen3": {
    "Embedding": {
      "model_id": "Qwen/Qwen3-Embedding-0.6B",
      "arch": "qwen3embedding"
    }
  }
}

See docs/HTTP.md for the full request schema.

Python SDK

Instantiate Runner with the embedding selector and request Qwen3 explicitly. The output mirrors the OpenAI embeddings array shape:

from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which

runner = Runner(
    which=Which.Embedding(
        model_id="Qwen/Qwen3-Embedding-0.6B",
        arch=EmbeddingArchitecture.Qwen3Embedding,
    )
)

request = EmbeddingRequest(
    input=["Graphene conductivity", "Explain superconductors in simple terms."],
    truncate_sequence=True,
)

embeddings = runner.send_embedding_request(request)
print(len(embeddings), len(embeddings[0]))

A ready-to-run version can be found at examples/python/qwen3_embedding.py.

Rust SDK

Use the EmbeddingModelBuilder helper just like with EmbeddingGemma. The example below mirrors the repository sample:

use anyhow::Result;
use mistralrs::{EmbeddingModelBuilder, EmbeddingRequest};

#[tokio::main]
async fn main() -> Result<()> {
    let model = EmbeddingModelBuilder::new("Qwen/Qwen3-Embedding-0.6B")
        .with_logging()
        .build()
        .await?;

    let embeddings = model
        .generate_embeddings(
            EmbeddingRequest::builder()
                .add_prompt("What is graphene?")
                .add_prompt("Explain superconductors in simple terms.")
        )
        .await?;

    println!("Returned {} vectors", embeddings.len());
    Ok(())
}

You can find the full example at mistralrs/examples/advanced/embeddings/main.rs.

Quantization in mistral.rs

Mistral.rs supports the following quantization:

⭐ ISQ (read more detail)
- Supported in all plain/vision and adapter models
- Works on all supported devices
- Automatic selection to use the fastest and most accurate method
- Supports:
  - Q, K type GGUF quants
  - AFQ
  - HQQ
  - FP8
  - F8Q8
GGUF/GGML
- Q, K type
- Supported in GGUF/GGML and GGUF/GGML adapter models
- Supported in all plain/vision and adapter models
- Imatrix quantization is supported
- I quants coming!
- CPU, CUDA, Metal (all supported devices)
- 2, 3, 4, 5, 6, 8 bit
GPTQ (convert with this script)
- Supported in all plain/vision and adapter models
- CUDA only
- 2, 3, 4, 8 bit
- Marlin kernel support in 4-bit and 8-bit.
AWQ (convert with this script)
- Supported in all plain/vision and adapter models
- CUDA only
- 4 and 8 bit
- Marlin kernel support in 4-bit and 8-bit.
HQQ
- Supported in all plain/vision and adapter models via ISQ
- 4, 8 bit
- CPU, CUDA, Metal (all supported devices)
FP8
- Supported in all plain/vision and adapter models
- CPU, CUDA, Metal (all supported devices)
BNB
- Supported in all plain/vision and adapter models
- bitsandbytes int8, fp4, nf4 support
AFQ
- 2, 3, 4, 6, 8 bit
- 🔥 Designed to be fast on Metal!
- Only supported on Metal.
MLX prequantized
- Supported in all plain/vision and adapter models

Using a GGUF quantized model

Use the gguf (cli) / GGUF (Python) model selector
Provide the GGUF file

mistralrs run --format gguf -f my-gguf-file.gguf

Using ISQ

See the docs

mistralrs run --isq 4 -m microsoft/Phi-3-mini-4k-instruct

Using a GPTQ quantized model

Provide the model ID for the GPTQ model
Mistral.rs will automatically detect and use GPTQ quantization for plain and vision models!
The Marlin kernel will automatically be used for 4-bit and 8-bit.

mistralrs run -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bit

You can create your own GPTQ model using [scripts/convert_to_gptq.py][../scripts/convert_to_gptq.py]:

pip install gptqmodel transformers datasets

python3 scripts/convert_to_gptq.py --src path/to/model --dst output/model/path --bits 4

Using a MLX prequantized model (on Metal)

Provide the model ID for the MLX prequantized model
Mistral.rs will automatically detect and use quantization for plain and vision models!
Specialized kernels will be used to accelerate inference!

mistralrs run -m mlx-community/Llama-3.8-1B-8bit

In situ quantization (ISQ)

In situ quantization (ISQ) quantizes model weights in place as they are loaded, so the full unquantized model never needs to fit in memory. Using with I/O and parallel pipelining, this means you can load and run a model that is larger than the total amount of RAM (CPU or GPU) on your system.

If the quantized weights are small enough to fit even though the original weights would not, you can still run the model! Like all quantization, ISQ may also increase inference performance due to reduced memory bandwidth pressure.

Quick start: Just use --isq 4 (or 2, 3, 5, 6, 8) and mistral.rs will pick the best quantization for your hardware:

mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct

An API is exposed on the Python and Rust SDKs which provides the ability to dynamically re-ISQ models at runtime.

To set the ISQ type for individual layers, use a model topology.

Note: 🔥 AFQ (affine) quantization is designed to be fast on Metal but is only supported on Metal.

Automatic ISQ (just use a number!)

Instead of specifying a quantization type like Q4K, you can just pass an integer (2, 3, 4, 5, 6, or 8) and mistral.rs will automatically select the best quantization method for your platform.

On Metal, this uses fast AFQ quantization (for 2, 3, 4, 6, or 8 bits). On other platforms, it falls back to Q/K quantization.

mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct

ISQ quantization types

AFQ2 (AFQ is only available on Metal)
AFQ3
AFQ4
AFQ6
AFQ8
Q4_0
Q4_1
Q5_0
Q5_1
Q8_0
Q8_1 (not available on CUDA)
Q2K
Q3K
Q4K
Q5K
Q6K
Q8K (not available on CUDA)
HQQ4
HQQ8
FP8
F8Q8

mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct

For Mixture of Expert models, a method called MoQE can be applied to only quantize MoE layers. This is configured via the ISQ “organization” parameter in all APIs. The following models support MoQE:

Phi 3.5 MoE
DeepSeek V2
DeepSeek V3 / DeepSeek R1
GLM4-MoE
GLM4-MoE-Lite
Qwen 3 (MoE variants)
Qwen3-VL-MoE (MoE variants)

Quantization strategies

ISQ supports two quantization strategies, selected automatically based on your configuration:

Immediate ISQ (default)

Immediate ISQ quantizes each weight as it is loaded during model construction rather than loading all weights first, then quantizing. This means only a small number of unquantized weight tensors need to be in CPU memory at any given time, enabling ISQ for models that would not otherwise fit in memory.

Quantization is parallelized across a thread pool on all devices. Multiple weights are quantized concurrently on CPU during loading, then moved to the target device. The number of threads depends on the ISQ type: GGML types (Q2K-Q8K) use all available CPU threads, while GPU-quantized types (HQQ, AFQ) use a single thread since the GPU work is serialized by a guard.

Set MISTRALRS_ISQ_SINGLETHREAD=1 to force single-threaded quantization.

Deferred ISQ

Deferred ISQ loads the full unquantized model into CPU memory first, then quantizes all weights in parallel in a post-processing pass. This path is used when an imatrix file (--imatrix) or calibration file (--calibration-file) is provided, since these require access to the full model or a forward pass before quantization can begin. Peak CPU memory usage is higher than immediate ISQ because the entire unquantized model must fit in memory during the quantization pass.

Accuracy

Accuracy of ISQ can be measured by the performance degradation versus the unquantized model. This is commonly measured with perplexity. Please see the perplexity example.

To improve the accuracy of a model with ISQ, use an imatrix file. These can be found online (for example, on Hugging Face), and should be passed with the --imatrix flag for plain models. This will increase the accuracy of the quantization significantly and bring the ISQ quantization up to par with the GGUF counterpart.

Check out the imatrix docs.

Python Example

runner = Runner(
    which=Which.Plain(
        model_id="Qwen/Qwen3-0.6B",
    ),
    in_situ_quant="4",
)

Rust Example

You can find this example here.

#![allow(unused)]
fn main() {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
    .with_isq(IsqType::Q8_0)
    .with_logging()
    .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
    .build()
    .await?;
}

Server example

mistralrs serve --port 1234 --isq 4 -m mistralai/Mistral-7B-Instruct-v0.1

Or with a specific quantization type:

mistralrs serve --port 1234 --isq Q4K -m mistralai/Mistral-7B-Instruct-v0.1

Universal Quantized File Format: UQFF

The uniquely powerful quantized file format.

Flexible 🌀: Multiple quantization formats in one file format with one framework to run them all.
Reliable 🔒: Compatibility ensured with embedded and checked semantic versioning information from day 1.
Easy 🤗: Download UQFF models easily and quickly from Hugging Face, or use a local file.
Customizable 🛠️: Make and publish your own UQFF files in minutes.

GGUF quantized:
- Q4_0
- Q4_1
- Q5_0
- Q5_1
- Q8_0
- Q8_1 (not available on CUDA)
- Q2K
- Q3K
- Q4K
- Q5K
- Q6K
- Q8K (not available on CUDA)
HQQ quantized:
- HQQ4
- HQQ8
FP8:
- FP8 E4M3 (4-bit exponent, 3-bit mantissa)
AFQ quantized (🔥 AFQ is fast on Metal):
- AFQ2
- AFQ3
- AFQ4
- AFQ6
- AFQ8
F8Q8:
- F8Q8

Loading a UQFF model

To load a UQFF model, specify the filename of the first (or only) UQFF shard. This will be located based on the model ID, and can be loaded locally or from Hugging Face based on the model ID.

phi3.5-mini-instruct-q4k-0.uqff
../UQFF/phi3.5-mini-instruct-q4k-0.uqff

You can find a collection of UQFF models here, which each include a simple command to get started.

Note: when loading an UQFF model, any ISQ setting will be ignored.

Shard auto-discovery

Large models produce multiple shard files (e.g., q4k-0.uqff, q4k-1.uqff, q4k-2.uqff). You only need to specify one shard file – the remaining shards are auto-discovered from the same directory or Hugging Face repository.

For example, if a model has shards q4k-0.uqff, q4k-1.uqff, and q4k-2.uqff:

# Just specify the first shard -- the rest are found automatically
mistralrs run -m EricB/MyModel-UQFF --from-uqff q4k-0.uqff

This also works when multiple quantizations exist in the same repo (e.g., q4k-* and q8_0-*). Only the shards matching the specified prefix are loaded.

Running with the CLI

mistralrs run -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-f8e4m3-0.uqff

Using with the Rust SDK

Check out the following examples:

Normal: uqff/main.rs
Vision: uqff_vision/main.rs

Using the Python SDK

Modify the Which instantiation as follows:

Which.Plain(
    model_id="EricB/Phi-3.5-mini-instruct-UQFF",
+   from_uqff="phi3.5-mini-instruct-q4k-0.uqff"
),

Using topology for device mapping with UQFF

When loading a UQFF model, the quantization is already baked in, so ISQ settings in the topology are ignored. However, device mapping from a topology file still applies. This is useful for splitting a pre-quantized model across multiple GPUs or offloading layers to CPU.

CLI example:

mistralrs run -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-q4k.uqff --topology device_map.yml

Topology file for device mapping only (device_map.yml):

0-16:
  device: cuda[0]
16-32:
  device: cuda[1]

Rust SDK example:

#![allow(unused)]
fn main() {
use mistralrs::{UqffTextModelBuilder, Topology, LayerTopology, Device};

let model = UqffTextModelBuilder::new(
    "EricB/Phi-3.5-mini-instruct-UQFF",
    vec!["phi3.5-mini-instruct-q4k.uqff".into()],
)
.into_inner()
.with_topology(
    Topology::empty()
        .with_range(0..16, LayerTopology { isq: None, device: Some(Device::Cuda(0)) })
        .with_range(16..32, LayerTopology { isq: None, device: Some(Device::Cuda(1)) })
)
.build()
.await?;
}

Python SDK example:

runner = Runner(
    which=Which.Plain(
        model_id="EricB/Phi-3.5-mini-instruct-UQFF",
        from_uqff="phi3.5-mini-instruct-q4k.uqff",
        topology="device_map.yml",
    ),
)

Note: The isq field in topology entries is ignored when loading UQFF models since quantization is pre-applied.

Creating a UQFF model

Creating a UQFF model requires you to generate the UQFF file.

Specify an output path: either a .uqff file path or a directory where files will be auto-named.
The quantization of a UQFF model is determined from the ISQ or model topology (see the topology docs for more details on how ISQ and the topology mix).

Along with the UQFF file, the generation process will also output several .json configuration files and residual.safetensors. All of these files are considered the UQFF model, and should be kept together or uploaded.

Note: Only the .uqff files are unique to the quantization level(s). If you are generating multiple UQFF files, it is OK for the others to be overwritten.

Single quantization (file output):

mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k -o phi3.5-uqff/phi3.5-mini-instruct-q4k.uqff

Single quantization (directory output):

mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k -o phi3.5-uqff/

Multiple quantizations at once (directory output):

Generate multiple UQFF files by specifying multiple --isq types. All quantizations go to the same output directory.

# Comma-separated ISQ types
mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k,q8_0 -o phi3.5-uqff/

# Equivalent: repeated --isq flags
mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k --isq q8_0 -o phi3.5-uqff/

This produces the following in phi3.5-uqff/:

q4k-0.uqff (and additional shards q4k-1.uqff, … if the model is large)
q8_0-0.uqff (and additional shards if needed)
README.md (auto-generated model card for Hugging Face)
Shared files: config.json, tokenizer.json, residual.safetensors, etc.

Note: Multiple --isq values require a directory output path (not a .uqff file path).

Model card generation

When using directory output mode, the quantize command automatically generates a README.md model card in the output directory. This model card includes Hugging Face YAML frontmatter, a description, and an examples table with the appropriate --from-uqff commands for each quantization.

To skip model card generation, use --no-readme:

mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k -o phi3.5-uqff/ --no-readme

Uploading to Hugging Face

After quantization completes in directory mode, the quantize command prints the huggingface-cli upload command you can use. The general form is:

huggingface-cli upload <YOUR_USERNAME>/<MODEL_NAME>-UQFF <output_dir> --repo-type model --private

Alternatively, you can upload with Git LFS:

Install git-lfs
Run git lfs install
(If the files are larger than 5GB) Run huggingface-cli lfs-enable-largefiles . (you will need to pip install huggingface_hub)

After this, you can use Git to track, commit, and push files.

List of models

You can find a list of models in the Hugging Face model collection.

Have you created a UQFF model on Hugging Face? If so, please create an issue.

UQFF internal structure

The following describes the exact memory layout of UQFF tensors of version 0.1.0.

GGUF quantization

ID	Element type	Endianness
UQFF version	u32	little endian
ISQ type (0)	u8	little endian
Tensor data length in bytes	u32	little endian
Whether bias data is included (boolean)	u8	little endian
Quantized dtype	u32	little endian
Num shape dims	u32	little endian
Array quantized weight shape dims	u32	little endian
Array quantized weight data	u8	little endian
[Optional] Array Bias tensor data, see docs	See docs	See docs

Unquantized layers

ID	Element type	Endianness
UQFF version	u32	little endian
ISQ type (1)	u8	little endian
Whether bias data is included (boolean)	u8	little endian
Array Weight tensor data, see docs	See docs	See docs
[Optional] Array Bias tensor data, see docs	See docs	See docs

FP8 layers

ID	Element type	Endianness
UQFF version	u32	little endian
ISQ type (1)	u8	little endian
Whether bias data is included (boolean)	u8	little endian
Array Weight tensor data, see docs	See docs	See docs
Dequant W scalar	f32	little endian
Dequant X scalar	f32	little endian
Quant scalar	f32	little endian
Quantization type	u32	little endian
[Optional] Array Bias tensor data, see docs	See docs	See docs

HQQ quantization

ID	Element type	Endianness
UQFF version	u32	little endian
ISQ type (2)	u8	little endian
Whether bias data is included (boolean)	u8	little endian
Array Q weight, see docs	See docs	See docs
Array Q scale, see docs	See docs	See docs
Array Q zeroes, see docs	See docs	See docs
Dequant weight num shape dims	u32	little endian
Array dequant weight shape dims	u32	little endian
CFG bits	u8	little endian
CFG group size	u32	little endian
CFG axis	u8	little endian
CFG optimization steps (0 means `Option::None` for now)	u32	little endian
CFG round zeroes (boolean)	u8	little endian
CFG channel wise (boolean)	u8	little endian

FP8 layers

ID	Element type	Endianness
UQFF version	u32	little endian
ISQ type (3)	u8	little endian
Whether bias data is included (boolean)	u8	little endian
Array Weight tensor data, see docs	See docs	See docs
Dequant scale W	f32	little endian
Dequant scale X	f32	little endian
Quant scale	f32	little endian
Layer dtype	u32	little endian
[Optional] Array Bias tensor data, see docs	See docs	See docs

Standard tensors

ID	Element type	Endianness
Tensor data length in bytes	u32	little endian
Tensor dtype	u32	little endian
Num shape dims	u32	little endian
Array shape dims	u32	little endian
Array flattened (contiguous) tensor data	u8	little endian

Model topology configuration

Quantization and device mapping in one file.

Note

Manual device mapping flags are deprecated in favor of automatic placement because it is easy to misconfigure them. Topology files remain the preferred way to express per-layer quantization, and you can still provide device overrides here when you truly need to. Those overrides win over the automatic mapper, so apply them sparingly. See the device mapping documentation for guidance.

Use a simple model topology to configure ISQ and device mapping for per-layer with a single YAML file (examples here)!

To support per-layer mix of ISQ, Mistral.rs supports loading a model topology YAML file. This YAML file is formatted as follows:

Top-level keys are either:
- A range of layers (start-end) where start < end. start is inclusive and end is exclusive
- A single layer number
1. The topology for the range or layer:
  - An optional key (isq) which maps to a single value, which can be any ISQ type. If not specified, there is no ISQ for this range of layers applied.
  - An optional key (device) which maps to a single value, which is one of the below. If not specified, the default loading deice will be used.
    - cpu
    - cuda[ORDINAL]
    - metal[ORDINAL]

Note that:

The topology for the range is expanded to fill the range
If ranges overlap, the range with the higher end layer takes precedence. When two ranges share the same end layer, the one that appears later in the topology file wins.
Any layers which are not covered will have no topology mapping. They will inherit any other ISQ (e.g. with --isq/in_situ_quant) set.
Unless the layer is not covered by the topology, the topology value will override any other ISQ (e.g. with --isq/in_situ_quant).
The topology device mapping will override any other device mapping.

Using topology with UQFF models

When loading a UQFF model, the quantization is already applied during UQFF creation. Therefore:

ISQ settings in the topology are ignored - the pre-quantized weights are used as-is
Device mapping still applies - you can split layers across GPUs or offload to CPU

This is useful for deploying pre-quantized models across multiple devices without re-quantizing.

Example topology for UQFF device mapping:

# Only device mapping is used; isq would be ignored
0-16:
  device: cuda[0]
16-32:
  device: cuda[1]

See the UQFF documentation for complete examples.

Regex selectors

Layer ranges are convenient when you know the numeric index, but you can also target weights by name. Keys wrapped in /.../ are interpreted as regular expressions that are matched against the fully qualified tensor name (for example, model.layers.3.attn.q_proj.weight). Regex selectors may override both isq and device.

'/attn\.q_proj$/':
  isq: Q4K
'/ffn_.*\.weight$/':
  isq: Q3K

Regex-based ISQ overrides are applied through the immediate ISQ system, so they quantize weights as they are loaded. Numeric layer ranges continue to be handled by the post-load topology pass. Regex selectors are evaluated top-to-bottom as they appear in the YAML file, so a selector that comes later in the file overrides earlier matches.

0-8:
  isq: Q3K
  device: cuda[0]
8-16:
  isq: Q4K
  device: cpu
16-24:
  isq: Q6K
# Skip 24-28
28-32:
  isq: Q8_0
  device: cuda[0]

Model topologies may be applied to all model types.

CLI example

mistralrs run -m microsoft/Phi-3-mini-128k-instruct --topology topologies/isq.yml

HTTP server example

mistralrs serve -p 1234 -m microsoft/Phi-3-mini-128k-instruct --topology topologies/isq.yml

Rust example

Example here.

Python example

Example here.

Enhancing ISQ with an imatrix

Mistral.rs supports enhancing the performance of models quantized with ISQ by collecting an imatix from calibration data. The following quantizations are supported with an imatrix:

Q2K
Q3K
Q4K
Q5K
Q6K

What is an imatrix? An imatrix (importance matrix) is generated from data collected during the execution of the model on calibration data. This data is used to enhance the performance of the model by enabling a weighted RMSE minimization when quantizing the tensor. For more information, see the original PR.

Using an imatrix causes the quantization process to take longer as the data must be collected, but there is no inference-time performance decrease.

Note: mistral.rs will automatically generate a .cimatrix file which can be used within mistral.rs as a replacement for a .imatrix file. The primary advantage is the in-situ generation within mistral.rs. The format is incompatible with llama.cpp.

To use this, simply specify the calibration data file in the various APIs as detailed below.

With the CLI

mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct --calibration-file calibration_data/calibration_datav3_small.txt

With the Rust SDK

You can find this example here.

#![allow(unused)]
fn main() {
let model = TextModelBuilder::new("meta-llama/Llama-3.2-3B-Instruct")
    .with_isq(IsqType::Q4K)
    .with_calibration_file("calibration_data/calibration_datav3_small.txt".into())
    .with_logging()
    .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
    .build()
    .await?;
}

With the Python SDK

You can find this example here.

runner = Runner(
    which=Which.Plain(
        model_id="meta-llama/Llama-3.2-3B-Instruct",
        calibration_file="calibration_data/calibration_datav3_small.txt"
    ),
    in_situ_quant="4",
)

Adapter model support

An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting an XLora* architecture, and LoRA support by selecting the Lora* architecture. For both X-LoRA and LoRA, an ordering file (see this section for preparing the ordering file) must be provided. The ordering file describes the ordering of layers and which adapters to use (and what order to use them in for X-LoRA).

When using an adapter model with a quantized base model, if the ordering file specifies unsupported layers you will receive an error.

Supported X-LoRA or LoRA quantized layers**

Llama architecture:

model.layers.{layer_idx}.self_attn.q_proj
model.layers.{layer_idx}.self_attn.k_proj
model.layers.{layer_idx}.self_attn.v_proj
model.layers.{layer_idx}.self_attn.o_proj
model.layers.{layer_idx}.mlp.up_proj
model.layers.{layer_idx}.mlp.down_proj
model.layers.{layer_idx}.mlp.gate_proj
lm_head

Phi 3 architecture:

model.layers.{layer_idx}.self_attn.qkv_proj
model.layers.{layer_idx}.self_attn.o_proj
model.layers.{layer_idx}.mlp.gate_up_proj
model.layers.{layer_idx}.mlp.down_proj
lm_head

Adapter ordering file

Preparing the X-LoRA/LoRA Ordering File The X-LoRA/LoRA ordering file is necessary to prepare before inference with an X-LoRA model. However, it is easy with a provided script!

X-LoRA case

An ordering JSON file for X-LoRA contains 2 major parts.

The adapter names order
- The order matters!
- Should be an array of strings which are the adapter names corresponding to the order the adapters were specified during training. For example, if the adapters were specified as a dictionary:
The layer ordering layers
- Automatically generated and should not be manipulated as it controls the application of scalings.

adapters = {
    "math": ...,
    "reasoning": ...,
    "biology": ...
}

The specified order would be ["math", "reasoning", "biology"].

We provide an ordering file which contains the ordering for the X-LoRA model associated with the paper and the Huggingface repository: https://huggingface.co/lamm-mit/x-lora.

LoRA case

An ordering JSON file for LoRA contains 2 major parts:

The adapter names order (optional):
- The order does not matter
- Come controls which adapters will be initially activated
- If this key is not specified, then no adapters will be activated initially
Preload adapter section preload_adapters (optional): see this section
- Order does not matter
- Specifies the adapter name and the model ID to find them, which may be a local path.

Preparing the ordering file (LoRA or X-LoRA cases)

There are 2 scripts to prepare the ordering file and which work for both X-LoRA and LoRA. The ordering file is specific to each architecture and set of target modules. Therefore, if either are changed, it is necessary to create a new ordering file using the first option. If only the adapter order or adapters changed, then the second option should be used.

From scratch: No ordering file for the architecture and target modules

A script create_ordering.py is provided which prompts the user for the model ID, target modules, and adapter names. The user is prompted for an output file location, relative to the working directory.
Create a new ordering file from an existing ordering file for an architecture and target modules

A script set_names.py is provided which prompts the user for the adapter names and the old ordering file. The user is prompted for an output file location, relative to the working directory.

Quantized X-LoRA or LoRA models

Mistral.rs supports running quantized models with X-LoRA or LoRA. The X-LoRA or LoRA adapter layers will not be quantized, only the base model. P

In the X-LoRA case, please note that using a high quantization level (eg., 4-bit) can distort the signal and prevent the classifier from acting properly. Therefore, it is better to use slightly lower levels such as 8-bit.

Avoiding the scaling pass with non-granular scalings

The X-LoRA implementation supports non-granular scalings. This caches the scalings after k completion tokens are generated and they will be used for the remaining passes avoiding the scaling pass. The number of tokens to generate before caching is defined by setting tgt_non_granular_index. Setting tgt_non_granular_index will restrict the maximum running sequences to 1.

Please see this page for more details and examples.

Adapter model dynamic adapter activation

We support dynamic adapter activation for LoRA models, allowing you to activate a set of adapters at runtime. There is a Python, Rust and HTTP API:

Rust: example
Python: example
HTTP: example

To use this feature, you should add a preload_adapters key to your ordering file:

{
    "order": ["..."],
    "layers": {"...": "123"},
    "base_model_id": "...",
+    "preload_adapters": [{"name": "...", "adapter_model_id": "..."}] # New field here
}

This allows mistral.rs to preload the adapter and enable runtime activation.

Examples of LoRA and X-LoRA models

X-LoRA with no quantization

To start an X-LoRA server with the exactly as presented in the paper:

mistralrs serve -p 1234 --xlora lamm-mit/x-lora --xlora-order orderings/xlora-paper-ordering.json

LoRA with a model from GGUF

To start a LoRA server with adapters from the X-LoRA paper (you should modify the ordering file to use only one adapter, as the adapter static scalings are all 1 and so the signal will become distorted):

mistralrs serve -p 1234 --format gguf -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q8_0.gguf --lora lamm-mit/x-lora

Normally with a LoRA model you would use a custom ordering file. However, for this example we use the ordering from the X-LoRA paper because we are using the adapters from the X-LoRA paper.

X-LoRA non-granular scalings

A key limitation of the X-LoRA architecture is the need for 2 forward passes of the model per generation step. To trade off model performance for speed, mistral.rs allows the user to reduce the granularity of the scalings by caching them in a technique we call Non Granular Scalings.

How it works

For the first $k$ generation steps, the scalings are calculated normally for each token. However, for the rest of the tokens, it is cached and re-used. In this way, we are able to avoid the second forward pass and the performance is increased significantly. To maintain correctness, enabling non-granular scalings will restrict the engine to processing one sequence at a time.

How to use it

Command line

This can be enabled by passing --tgt-non-granular-index followed by $k$:

mistralrs serve -p 1234 --xlora lamm-mit/x-lora --xlora-order orderings/xlora-paper-ordering.json --tgt-non-granular-index 5

Python

Set the tgt_non_granular_index attribute to a non-None value in the Which selection:

from mistralrs import Runner, Which

runner = Runner(
    which=Which.XLoraGGUF(
        tok_model_id=None,  # Automatically determine from ordering file
        quantized_model_id="TheBloke/zephyr-7B-beta-GGUF",
        quantized_filename="zephyr-7b-beta.Q4_0.gguf",
        xlora_model_id="lamm-mit/x-lora",
        order="orderings/xlora-paper-ordering.json",
        tgt_non_granular_index=5,
    )
)

...

Build a memory-efficient MoE model from anything, in seconds

AnyMoE is technique to dynamically and efficiently create MoE models. By providing a set of experts and a small pretraining dataset, you can create an MoE locally!

It has the following features:

Apply AnyMoE to any supported model
- plain
- vision-plain
Specify the layers to apply AnyMoE to for efficient training

Paper: https://arxiv.org/abs/2405.19076

https://github.com/EricLBuehler/mistral.rs/assets/65165915/33593903-d907-4c08-a0ac-d349d7bf33de

Note: By default, this has the capability to create an csv loss image. When building from source (for Python or CLI), you may use --no-default-features command line to disable this. This may be necessary if networking is unavailable.

Dataset

Currently, AnyMoE expects a JSON dataset with one top-level key row, which is an array of objects with keys prompt (string), expert (integer), and image_urls (optional array of strings). For example:

{
    "rows": [
        {
            "prompt": "Discuss the impact of Renaissance art on modern aesthetics",
            "expert": 0
        },
        {
            "prompt": "Explain the significance of the theory of relativity in modern physics",
            "expert": 1
        },
    ]
}

For a vision model, image_urls may contain an array of image URLs/local paths or Base64 encoded images.

Experts

AnyMoE experts can be either fine-tuned models or LoRA adapter models. Only the mlp layers will be loaded from each. The experts must be homogeneous: they must be all fine-tuned or all adapter. Additionally, certain layers can be specified to apply AnyMoE.

Note: When using LoRA adapter experts, it may not be necessary to set the layers where AnyMoE will be applied due to the lower memory usage.

Example of TOML selector with fine-tuned experts

[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"

[anymoe]
dataset_json = "examples/amoe.json"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["HuggingFaceH4/zephyr-7b-beta"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096
expert_type = "fine_tuned"

Example of TOML selector with LoRA adapter experts

[model]
model_id = "HuggingFaceH4/zephyr-7b-beta"
arch = "mistral"

[anymoe]
dataset_json = "examples/amoe.json"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["EricB/example_adapter"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096

[anymoe.config.expert_type.lora_adapter]
rank = 16
alpha = 16
target_modules = ["gate_proj"]

Examples

CLI

CLI usage is via the TOML selector where you can also find docs on the required fields.

For example, to use the demo fine-tuned expert:

mistralrs from-config --file toml-selectors/anymoe.toml

To use the demo LoRA expert:

mistralrs from-config --file toml-selectors/anymoe_lora.toml

Python example

from mistralrs import (
    Runner,
    Which,
    ChatCompletionRequest,
    Architecture,
    AnyMoeConfig,
    AnyMoeExpertType,
)

runner = Runner(
    which=Which.Plain(
        model_id="mistralai/Mistral-7B-Instruct-v0.1",
        arch=Architecture.Mistral,
    ),
    anymoe_config=AnyMoeConfig(
        hidden_size=4096,
        dataset_json="examples/amoe.json",
        prefix="model.layers",
        mlp="mlp",
        expert_type=AnyMoeExpertType.FineTuned(),
        lr=1e-3,
        epochs=100,
        batch_size=4,
        model_ids=["HuggingFaceH4/zephyr-7b-beta"],
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    AnyMoeConfig, AnyMoeExpertType, AnyMoeModelBuilder, IsqType, PagedAttentionMetaBuilder,
    TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let text_builder = TextModelBuilder::new("mistralai/Mistral-7B-Instruct-v0.1")
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(PagedAttentionMetaBuilder::default().build()?);

    let model = AnyMoeModelBuilder::from_text_builder(
        text_builder,
        AnyMoeConfig {
            hidden_size: 4096,
            lr: 1e-3,
            epochs: 100,
            batch_size: 4,
            expert_type: AnyMoeExpertType::LoraAdapter {
                rank: 64,
                alpha: 16.,
                target_modules: vec!["gate_proj".to_string()],
            },
            gate_model_id: None, // Set this to Some("path/to/model/id") for the pretrained gating model id
            training: true,
            loss_csv_path: None,
        },
        "model.layers",
        "mlp",
        "examples/amoe.json",
        vec!["HuggingFaceH4/zephyr-7b-beta"],
        vec![0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    )
    .build()
    .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Matformer (Matryoshka Transformer) Support

Matformer allows you to dynamically resize transformer models at runtime, trading compute/memory for quality. This enables deploying the same model across devices with different resource constraints - from edge devices to powerful GPUs.

Quick Start

Command Line

# Run Gemma 3n with the E2.49B configuration (2.49B params instead of 3.98B)
mistralrs run -m google/gemma-3n-E4B-it \
  --matformer-config-path matformer_configs/gemma3n.csv \
  --matformer-slice-name "Config for E2.49B (block-level)"

Python

from mistralrs import Runner, Which, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3n-E4B-it",
        arch=VisionArchitecture.Gemma3n,
        matformer_config_path="matformer_configs/gemma3n.csv",
        matformer_slice_name="Config for E2.49B (block-level)",
    ),
)

Rust

#![allow(unused)]
fn main() {
use mistralrs::VisionModelBuilder;
use std::path::PathBuf;

let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
    .with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
    .with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
    .build()
    .await?;
}

How It Works

Matformer models are pre-trained with a special architecture that allows certain layers to be skipped at inference time while maintaining reasonable quality. When you select a “slice”:

Layer Skipping: Specified layers are completely removed from computation
FFN Resizing: Feed-forward network dimensions can be adjusted per layer
Automatic Remapping: Remaining layers are renumbered sequentially

For example, the Gemma 3n E2.49B (block-level) slice:

Keeps all 35 layers (no layer skipping)
Uses mixed FFN dimensions: 8192 for layers 0-19, 16384 for layers 20-24, 8192 for layers 25-34
Cuts parameters from 3.98B to 2.49B (~37% reduction)
Maintains ~87% of the full model’s quality

Configuration Files

Matformer configurations are CSV files with these columns:

name,# Layers,# Effective Params (B),MMLU PT accuracy,FFN Hidden Dims,Layers Skipped
Main model,35,3.98,62.30%,"[16384, 16384, ...]",
Config for E2.49B (block-level),35,2.49,54.50%,"[8192, 8192, ..., 16384, 16384, ..., 8192, 8192, ...]",

name: Slice identifier used in matformer_slice_name
# Layers: Number of active layers after skipping
# Effective Params (B): Approximate parameter count in billions
MMLU PT accuracy: Benchmark score (informational)
FFN Hidden Dims: List of FFN dimensions for each layer
Layers Skipped: Which layers to remove (0-indexed)

Supported Models

Currently supported:

Gemma 3n (google/gemma-3n-E4B-it) - Multimodal model with vision and audio

See matformer_configs/ for available configurations.

Performance Guide

Memory Usage

Memory scales approximately with parameter count:

Full model (3.98B): ~8GB VRAM
E2.49B slice: ~5GB VRAM
E2B slice (1.91B): ~4GB VRAM
Smaller slices: Proportionally less

Inference Speed

Speed improvement is roughly linear with layer count:

30 layers vs 35 layers = ~14% faster
20 layers vs 35 layers = ~43% faster

Quality Trade-offs

Example accuracy on MMLU benchmark:

Full model: 62.3%
E2.98B: 59.5% (-4.5%)
E2.49B: 54.5% (-12.5%)
E2B: 50.9% (-18.3%)

Choose based on your requirements:

Maximum quality: Use full model (omit matformer args)
Balanced: E2.49B to E2.98B configurations (block-level configs recommended)
Resource-constrained: E2B configuration (1.91B params)
Extreme efficiency: E1.96B configuration

Advanced Usage

With Quantization

Combine Matformer with ISQ for maximum efficiency:

runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3n-E4B-it",
        arch=VisionArchitecture.Gemma3n,
        matformer_config_path="matformer_configs/gemma3n.csv",
        matformer_slice_name="Config for E2.49B (block-level)",
    ),
    in_situ_quant="Q4K"  # 4-bit quantization
)

With Device Mapping

Matformer works seamlessly with automatic device mapping:

#![allow(unused)]
fn main() {
use mistralrs::{VisionModelBuilder, DeviceMapSetting, AutoDeviceMapParams};

let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
    .with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
    .with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
    .with_device_mapping(DeviceMapSetting::Auto(
        AutoDeviceMapParams::default_vision()
    ))
    .build()
    .await?;
}

Only active layers are loaded to GPU, saving memory.

Creating Custom Configurations

To create your own Matformer configuration:

Start with the full model as baseline
Identify skippable layers:
- Middle layers (10-30) are often good candidates
- Avoid early layers (feature extraction) and late layers (final representations)
- Never skip special layers (KV-sharing, attention patterns)
Test quality degradation at each configuration
Create CSV file with your configurations

Example minimal configuration:

name,# Layers,# Effective Params (B),FFN Hidden Dims,Layers Skipped
Tiny,15,0.8,"[4096, 4096, ...]","[5,6,7,10,11,12,15,16,17,20,21,22,25,26,27,30,31,32,33,34]"

API Reference

Command Line Arguments

--matformer-config-path PATH: Path to CSV configuration file
--matformer-slice-name NAME: Exact name of slice from CSV

Python Parameters

Which.VisionPlain(
    model_id: str,
    arch: VisionArchitecture,
    matformer_config_path: str = None,  # Path to CSV
    matformer_slice_name: str = None,   # Slice name
    # ... other parameters
)

Rust Methods

#![allow(unused)]
fn main() {
// For VisionModelBuilder
.with_matformer_config_path(path: PathBuf)
.with_matformer_slice_name(name: String)

// For TextModelBuilder (when supported)
.with_matformer_config_path(path: PathBuf)  
.with_matformer_slice_name(name: String)
}

Troubleshooting

Common Issues

“Matformer slice ‘X’ not found”

Check slice name matches exactly (case-sensitive)
Verify CSV file path is correct

“Layers X and Y are reserved and cannot be skipped”

Some models have special layers that must not be skipped
Try different layer combinations

Memory not reduced as expected

Ensure you’re using the slice (check logs)
Skipped layers still need to be loaded initially
Consider combining with quantization

Debugging

Enable logging to see Matformer details:

RUST_LOG=mistralrs_core=info mistralrs ...

This shows:

Configuration file loaded
Selected slice details
Layers being skipped
Final layer count

Future Plans

Support for more model architectures
Dynamic slice switching during runtime
Automatic slice selection based on available resources
Fine-tuning tools for creating new Matformer models

References

Device mapping

In mistral.rs, device mapping is automatically managed to be as performant and easy as possible. Automatic device mapping is enabled by default in the CLI/server and Python SDK and does not make any changes when the model fits entirely on the GPU.

Note

If your system has more than one CUDA device, mistral.rs will automatically use tensor parallelism. If the model does not completely fit on the available GPUs, or you wish to use automatic device mapping, you can disable tensor parallelism by setting MISTRALRS_NO_NCCL=1.

Automatic device mapping works by prioritizing loading models into GPU memory, and any remaining parts are loaded into CPU memory. Models architectures such as vision models which greatly benefit from GPU acceleration also automatically prioritize keeping those components on the GPU.

To control the mapping across devices, you can set the following maximum parameters which the model should expect in a prompt.

maximum sequence length (default: 4096)
maximum batch size (default: 1)
(vision models) maximum image length (length refers to the edge length) (default: 1024)
(vision models) maximum number of images (default: 1)

These parameters do not translate to hard limits during runtime, they only control the mapping.

Unified memory systems

On integrated GPU systems (e.g. Apple Silicon, NVIDIA Grace Blackwell, Jetson) where GPU and CPU share the same physical RAM, the auto device mapper caps the GPU memory budget to a fraction of system RAM (75% by default for CUDA iGPUs, configurable via MISTRALRS_IGPU_MEMORY_FRACTION; Metal uses the iogpu.wired_limit_mb sysctl). CPU offload capacity is limited to the remaining fraction to prevent over-subscription of shared memory. Use mistralrs doctor to check whether your device is detected as unified memory.

Note

The maximum sequence length is also used to ensure that a KV cache will fit for with and without PagedAttention.

Examples

Python
- Text models text_auto_device_map.py
- Vision models vision_auto_device_map.py
Rust
- Text models text_auto_device_map/main.rs
- Vision models vision_auto_device_map/main.rs

Server

Text models:

mistralrs run --isq 4 -m meta-llama/Llama-3.3-70B-Instruct --max-seq-len 4096 --max-batch-size 2

Vision models:

mistralrs run --isq 4 -m meta-llama/Llama-3.2-11B-Vision-Instruct --max-seq-len 4096 --max-batch-size 2 --max-num-images 2 --max-image-length 1024

If you want to manually device map the model (not recommended), please continue reading.

Note

Manual device mapping is deprecated in favor of automatic device mapping due to the possibility for user error in manual.

Manual device mapping

There are 2 ways to do device mapping:

Specify the number of layers to put on the GPU - this uses the GPU with ordinal 0.
Specify the ordinals and number of layers - this allows for cross-GPU device mapping.

The format for the ordinals and number of layers is ORD:NUM;... where ORD is the unique ordinal and NUM is the number of layers for that GPU. This may be repeated as many times as necessary.

Note: We refer to GPU layers as “device layers” throughout mistral.rs.

Example of specifying ordinals

mistralrs run -n "0:16;1:16" -m gradientai/Llama-3-8B-Instruct-262k

Note: In the Python SDK, the “0:16;1:16” string is passed as the list ["0:16", "1:16"].

Example of specifying the number of GPU layers

mistralrs run -n 16 -m gradientai/Llama-3-8B-Instruct-262k

PagedAttention in mistral.rs

Mistral.rs supports PagedAttention (paper here) to accelerate both normal inference and batched inference on:

CUDA (Unix-like platforms such as WSL, Linux)
Metal

Our PagedAttention implementation has 2 inputs: GPU KV cache memory size, and block size. This enables you to have fine-tuned control over the available context length, by configuring the available memory for KV cache. When using a CUDA device, PagedAttention is actiated by default but can be disabled with no_paged_attn for Python or no-paged-attn for the CLI tools.

KV Cache Quantization

PagedAttention now supports KV cache quantization to reduce memory usage and potentially improve performance. The KV cache can be quantized to FP8 (F8E4M3 format) instead of using the model’s native dtype, significantly reducing memory requirements while maintaining model quality.

Available cache types:

auto (default): Uses the model’s native dtype for KV cache
f8e4m3: Quantizes KV cache to 8-bit floating point (E4M3 format)

When using FP8 quantization, the memory usage for KV cache is approximately halved compared to FP16, allowing for longer context lengths with the same GPU memory allocation.

Note: The default block size if not specified is 32.

Note: if OOM occurs (this can be caused by a variety of factors including adapter activation, re-ISQ, and others), it is likely because the PagedAttention KV cache has already been allocated. To counter this, either set the KV cache memory to a lower amount or usage percentage (recommended) or disable paged attention entirely for a dynamically allocated cache.

Note: Paged Attention is not enabled on Windows platforms, only Unix-based platforms.

Note: In the CLI and Python SDK, Paged Attention is disabled by default for Metal. It can be enabled with the --paged-attn/paged_attn flags.

There are more features being added to this:

GGML model support
Adapter model support
Speculative decoding

Prefix caching is now supported with PagedAttention. PagedAttention can leverage the prefix cacher to cache KV prefix states across iterations for faster multi-turn inference.

Block-Level Prefix Caching

Prefix caching is a technique to reuse computed KV cache blocks across requests that share common prefixes (like system prompts). This can significantly speed up inference when multiple requests use the same prefix.

How It Works

Block Hashing: Each block of tokens is assigned a unique hash based on its contents and the hash of its parent block:
```
hash(block) = hash(parent_hash, block_tokens)
```
This creates a hash chain that uniquely identifies any prefix sequence.
Cache Lookup: When allocating blocks for a new request, the scheduler checks if any full blocks match existing cached blocks by comparing hashes.
Block Reuse: Matched blocks are reused directly - their pre-computed KV cache values are used without recomputation. Only the non-matching suffix tokens need to be processed.
LRU Eviction: When memory is needed, least recently used cached blocks are evicted first.

Benefits

Multi-turn conversations: System prompts and conversation history are cached and reused
Batched requests: Multiple requests with shared prefixes (e.g., same system prompt) benefit from caching
Reduced TTFT: Time-to-first-token is reduced by skipping prefix computation

How It’s Enabled

Prefix caching is enabled by default when using PagedAttention and controlled by the same prefix_cache_n setting that controls the sequence-level prefix cacher:

CLI: --prefix-cache-n <N> (default 16). Set to 0 to disable prefix caching.
Python SDK: prefix_cache_n=<N> (default 16). Set to None or 0 to disable.
Rust SDK: .with_prefix_cache_n(Some(N)) (default 16). Pass None to disable.

Important: The two prefix caching systems are mutually exclusive:

PagedAttention uses block-level prefix caching (handled by PrefixCacher in BlockEngine)
Non-PagedAttention uses sequence-level prefix caching (handled by PrefixCacheManagerV2)

The prefix_cache_n setting controls both systems, but only one is active depending on whether PagedAttention is enabled. You’ll see one of these log messages at startup indicating which system is active:

Prefix caching enabled (block-level, PagedAttention).
Prefix caching enabled (sequence-level, non-paged attention).

Implementation Details

The prefix cache operates at the block level (not token level) for efficiency:

Full blocks only: Only complete blocks (block_size tokens) are cached. Partial blocks at the end of a sequence are not cached.
Hash chain: The hash for each block depends on all preceding blocks, ensuring the entire prefix matches.
Copy-on-Write: Cached blocks use reference counting. When a cached block needs modification, it’s copied first (CoW).
Memory management: The cache uses LRU eviction when allocating new blocks. Evicted blocks are returned to the free pool.

Performance Considerations

Block size affects cache granularity: larger blocks = fewer cache entries but coarser matching
Cache hit rate improves with more repeated prefixes
Memory overhead is minimal (just hash-to-block mappings)

Supported models:

Normal models
GGUF models
Vision models

Note: Prefix caching is supported when using PagedAttention. Configure the number of sequences to cache on the device with:

CLI: --prefix-cache-n <N> (default 16)

Python SDK: prefix_cache_n=<N> (default 16)

Rust SDK: .with_prefix_cache_n(Some(N)) (default 16)

Metal Memory Behavior

On Metal (macOS Apple Silicon), the GPU and CPU share the same physical RAM (unified memory). Unlike CUDA GPUs with dedicated VRAM where unused memory would otherwise be wasted, allocating large KV caches on Metal wires physical RAM away from the OS and CPU, which can cause system-wide memory pressure and thrashing.

To avoid this, mistral.rs automatically caps the PagedAttention KV cache on Metal to max_seq_len * max_batch_size tokens — just enough for the configured context length. On CUDA, the full available memory is used for maximum request concurrency (following the vLLM approach).

You can override this behavior on any platform with --pa-memory-mb to set an explicit KV cache budget in megabytes.

FlashAttention V2/V3 + PagedAttention in mistral.rs

If mistral.rs is compiled with FlashAttention and PagedAttention is enabled, then FlashAttention will be used in tandem to accelerate the prefill phase.

Using the CLI

Add the --pa-gpu-mem/--pa-gpu-mem-usage and --pa-blk-size parameters before the model kind selector. The GPU memory is in MBs and the block size means the number of tokens per block. These parameters may be passed on any supported model type.

To enable KV cache quantization, use the --pa-cache-type parameter with either auto (default) or f8e4m3.

mistralrs run --pa-memory-mb 8192 --pa-block-size 32 --isq 4 -m microsoft/Phi-3-mini-128k-instruct

mistralrs run --pa-memory-fraction 0.95 --pa-block-size 32 --format gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Example with FP8 KV cache quantization:

mistralrs run --paged-attn on --pa-memory-mb 4096 --pa-block-size 32 --pa-cache-type f8e4m3 -m microsoft/Phi-3-mini-128k-instruct

Using the Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    IsqType, MemoryGpuConfig, PagedAttentionMetaBuilder, TextMessageRole, TextMessages,
    TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(
            PagedAttentionMetaBuilder::default()
                .with_block_size(32)
                .with_gpu_memory(MemoryGpuConfig::ContextSize(1024))
                .build()?,
        )
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Example with FP8 KV cache quantization:

use anyhow::Result;
use mistralrs::{
    IsqType, MemoryGpuConfig, PagedAttentionMetaBuilder, PagedCacheType, 
    TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(
            PagedAttentionMetaBuilder::default()
                .with_block_size(32)
                .with_gpu_memory(MemoryGpuConfig::ContextSize(1024))
                .with_cache_type(PagedCacheType::F8E4M3)
                .build()?,
        )
        .build()
        .await?;

    // ... rest of the code remains the same
}

Using the Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="mistralai/Mistral-7B-Instruct-v0.1",
        arch=Architecture.Mistral,
    ),
    pa_gpu_mem = 4096,
    pa_blk_size = 32,
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Example with FP8 KV cache quantization:

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture, PagedCacheType

runner = Runner(
    which=Which.Plain(
        model_id="mistralai/Mistral-7B-Instruct-v0.1",
        arch=Architecture.Mistral,
    ),
    pa_gpu_mem = 4096,
    pa_blk_size = 32,
    pa_cache_type = PagedCacheType.F8E4M3,
)

# ... rest of the code remains the same

Speculative Decoding

Speculative decoding is an inference acceleration technique that uses a smaller “draft” model to propose tokens, which are then validated in parallel by the larger “target” model. This can significantly speed up generation when the draft model frequently predicts tokens the target model would also choose.

Mistral.rs implements speculative decoding based on the paper: Fast Inference from Transformers via Speculative Decoding.

How It Works

The draft model generates gamma candidate tokens autoregressively
The target model evaluates all candidate tokens in a single forward pass
Using rejection sampling, tokens are accepted or rejected:
- Accept if the target model’s probability >= draft model’s probability
- Otherwise, accept with probability p_target(x) / p_draft(x)
- If rejected, sample from the normalized difference distribution

This approach guarantees the same output distribution as running the target model alone, while often achieving significant speedups.

Configuration

The key parameter is gamma - the number of draft tokens to generate per speculation step. Higher values can increase throughput when the draft model is accurate, but waste computation when predictions are frequently rejected.

Recommended values: Start with gamma = 12-32 and tune based on your models and workload.

Requirements

Same tokenizer: Both target and draft models must share the same tokenizer vocabulary
Same model category: Both must be the same type (e.g., both text models or both vision models)
KV cache enabled: Both models must have KV caching enabled (default behavior)

Compatibility

PagedAttention: Supported.
Prefix caching: Supported for both sequence-level (non-paged) and paged-attention backends.
Hybrid KV caches: Supported, including hybrid recurrent-state snapshot/restore during rejection handling.
Batching: Supported. Multi-sequence speculative requests are executed per sequence internally, preserving batching semantics.

Using TOML Configuration

The recommended way to configure speculative decoding is via TOML. Create a config file (e.g., speculative.toml):

[model]
model_id = "meta-llama/Llama-3.1-8B-Instruct"

[speculative]
gamma = 12

[speculative.draft_model]
model_id = "meta-llama/Llama-3.2-1B-Instruct"

Then run with:

mistralrs run --from-toml speculative.toml

The draft model can use any supported format (Plain, GGUF, etc.) and can have different quantization than the target model.

TOML with GGUF Draft Model

[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"

[speculative]
gamma = 16

[speculative.draft_model]
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
model_file = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"

TOML with ISQ Quantization

[model]
model_id = "meta-llama/Llama-3.1-8B-Instruct"

[speculative]
gamma = 16

[speculative.draft_model]
model_id = "meta-llama/Llama-3.2-1B-Instruct"
isq = "Q8_0"

Using the Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="mistralai/Mistral-7B-Instruct-v0.1",
        arch=Architecture.Mistral,
    ),
    which_draft=Which.GGUF(
        tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
        quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
        quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    ),
    speculative_gamma=32,
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Python SDK Parameters

Parameter	Type	Description
`which_draft`	`Which`	Draft model specification (Plain, GGUF, etc.)
`speculative_gamma`	`int`	Number of draft tokens per step (default: 32)

Using the Rust SDK

You can find this example at mistralrs/examples/advanced/speculative/main.rs.

use anyhow::Result;
use mistralrs::{
    IsqType, RequestBuilder, SpeculativeConfig, TextMessageRole, TextMessages,
    TextModelBuilder, TextSpeculativeBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let target = TextModelBuilder::new("meta-llama/Llama-3.1-8B-Instruct")
        .with_logging();
    let draft = TextModelBuilder::new("meta-llama/Llama-3.2-1B-Instruct")
        .with_logging()
        .with_isq(IsqType::Q8_0);
    let spec_cfg = SpeculativeConfig { gamma: 16 };

    let model = TextSpeculativeBuilder::new(target, draft, spec_cfg)?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Choosing Draft and Target Models

For best performance:

Use the same model family - Draft models from the same family as the target (e.g., Llama 3.2-1B with Llama 3.1-8B) typically have higher acceptance rates
Smaller is better for draft - The draft model should be significantly smaller than the target for meaningful speedup
Quantize the draft model - Using ISQ or GGUF quantization on the draft model reduces memory and improves draft generation speed
Tune gamma - Monitor acceptance rates and adjust gamma accordingly

Example Model Pairings

Target Model	Draft Model	Notes
Llama 3.1-8B	Llama 3.2-1B	Same family, good acceptance
Llama 3.1-70B	Llama 3.1-8B	Large speedup potential
Mistral-7B	Mistral-7B (Q4_K_M GGUF)	Same model, quantized draft

Performance Considerations

Acceptance rate: Higher acceptance rates lead to better speedups. Monitor your logs for rejection statistics.
Draft model overhead: If the draft model is too large relative to the target, the overhead may negate speedup benefits.
Batch size: Speculative decoding supports batched requests, but currently executes each sequence independently inside the speculative step. For very high-throughput workloads, standard decoding may still be more efficient.
Memory usage: Both models must fit in memory simultaneously. Consider quantizing one or both models.

Combining with Other Features

Speculative decoding can be combined with:

ISQ quantization - Quantize target, draft, or both models
X-LoRA adapters - Use adapters on the target model
Device mapping - Distribute models across multiple GPUs

See examples/python/speculative_xlora.py for an example combining speculative decoding with X-LoRA.

FlashAttention in mistral.rs

Mistral.rs supports FlashAttention V2 and V3 on CUDA devices (V3 is only supported when CC >= 9.0).

Note: If compiled with FlashAttention and PagedAttention is enabled, then FlashAttention will be used in tandem to accelerate the prefill phase.

GPU Architecture Compatibility

Architecture	Compute Capability	Example GPUs	Feature Flag
Ampere	8.0, 8.6	RTX 30*, A100, A40	`--features flash-attn`
Ada Lovelace	8.9	RTX 40*, L40S	`--features flash-attn`
Hopper	9.0	H100, H800	`--features flash-attn-v3`
Blackwell	10.0, 12.0	RTX 50*	`--features flash-attn`

Note: FlashAttention V2 and V3 are mutually exclusive Note: To use FlashAttention in the Python SDK, compile from source.

Multi-head Latent Attention (MLA) in mistral.rs

Multi-head Latent Attention (MLA) is an efficient attention mechanism that reduces KV cache memory usage by compressing key-value states into a low-rank latent space. This technique was introduced in DeepSeek V2 and is also used in DeepSeek V3 and GLM-4.7-Flash models.

How It Works

MLA compresses the key-value cache by:

Projecting KV states into a compact latent representation (kv_lora_rank dimensions)
Storing only the compressed latent vectors and rotary position embeddings in the KV cache
Reconstructing full KV states on-the-fly during attention computation

This results in significant memory savings compared to standard multi-head attention, enabling longer context lengths with the same GPU memory.

Supported Models

MLA is automatically enabled for the following model architectures when using PagedAttention on CUDA:

Model	Architecture	MLA Dimensions
DeepSeek V2	`deepseekv2`	kv_lora_rank varies
DeepSeek V3	`deepseekv3`	kv_lora_rank=512, kpe_head_dim=64
GLM-4.7-Flash	`glm4moelite`	kv_lora_rank=512, kpe_head_dim=64

Requirements

MLA decode optimization requires:

CUDA on Unix-like platforms (Linux, WSL)
PagedAttention enabled
Compatible model architecture (see table above)

When these conditions are met, MLA is automatically used during the decode phase for optimal performance.

Performance Benefits

MLA provides two key optimizations:

Reduced KV Cache Memory: The compressed latent representation uses significantly less memory than full key-value states, allowing for:
- Longer context lengths
- Larger batch sizes
- More efficient memory utilization
Optimized Decode Kernels: Custom FlashInfer-based MLA kernels accelerate single-token generation by:
- Operating directly on compressed latent states
- Avoiding repeated KV decompression
- Leveraging efficient memory access patterns

Disabling MLA

If you encounter issues or want to compare performance, you can disable MLA by setting the environment variable:

MISTRALRS_NO_MLA=1 mistralrs ...

When disabled, the model falls back to standard PagedAttention with full KV cache storage.

Technical Details

KV Cache Layout

When MLA is enabled, PagedAttention uses a specialized cache layout:

Key cache: Stores compressed latent vectors (kv_lora_rank dimensions) + rotary position embeddings (kpe_head_dim dimensions)
Value cache: Shares the same block structure for efficient memory management

Decode Path

During single-token generation (decode phase):

Query is projected to latent space
Attention is computed directly on compressed KV states using FlashInfer MLA kernels
Output is projected back from latent space

Prefill Path

During prompt processing (prefill phase):

Full KV states are computed for the current chunk
Compressed latents are stored in the PagedAttention cache
For prefix-cached sequences, latents are retrieved and decompressed as needed

Distributed inference in mistral.rs

Mistral.rs supports distributed inference with a few strategies

NCCL (recommended for CUDA)
Ring backend (supported on all devices)

What backend is best?

For CUDA-only system: NCCL
Anything else: Ring backend

The Ring backend is also heterogenous! This means that you can use the Ring backend on any set of multiple devices connected over TCP. For example, you can connect 2 Metal systems, or 2 Metal and 1 CPU system with the Ring backend!

NCCL in mistral.rs

Mistral.rs supports distributed inference on CUDA with Tensor Parallelism via NCCL.

Note: Multi-node support is coming! Distributed inference on Apple hardware is also being investigated.

Tensor Parallelism (TP) is automatically used to accelerate distributed inference when more than one CUDA GPUs are detected. The tensor parallelism size is always automatically set to the total number of GPUs.

TP splits the model into shards and benefits from fast single-node interconnects like NVLink. Both normal and vision models support tensor parallelism.

Important: The world size (total number of GPUs) must be a power of 2 (e.g., 1, 2, 4, 8, 16, 32, etc.). This is a requirement for optimal performance and correct operation of the distributed algorithms.

Note: In mistral.rs, if NCCL is enabled, then automatic device mapping will not be used.

Important: To build for NCCL, be sure to add the nccl feature flag (for example: --features nccl,cuda).

See the following environment variables:

Name	Function	Usage
`MISTRALRS_NO_NCCL=1`	Disable TP and NCCL	If the model does not fit on the available CUDA devices, disabling NCCL will re-enable automatic device mapping

Single-Node Support

Set the number of ranks using MISTRALRS_MN_LOCAL_WORLD_SIZE, e.g.,

MISTRALRS_MN_LOCAL_WORLD_SIZE=2 mistralrs serve -p 8000 -m Qwen/Qwen3-30B-A3B-Instruct-2507

where, if no MISTRALRS_MN_LOCAL_WORLD_SIZE env given, mistral.rs will split the model across all available devices.

Multi-node support

# Head node:
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_HEAD_NUM_WORKERS=1 MISTRALRS_MN_HEAD_PORT=<PORT> mistralrs run -m ...

# For the worker nodes:
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_WORKER_ID=0 MISTRALRS_WORKER_SERVER_ADDR=<HEAD ADDR>:<PORT> mistralrs run -m ...
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_WORKER_ID=1 MISTRALRS_WORKER_SERVER_ADDR=<HEAD ADDR>:<PORT> mistralrs run -m ...
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_WORKER_ID=2 MISTRALRS_WORKER_SERVER_ADDR=<HEAD ADDR>:<PORT> mistralrs run -m ...

Multi-node support in mistral.rs divides the nodes into two groups: a “head” node, and multiple “worker” nodes. Head node choice is arbitrary. For example, if a system has 8 nodes, there will be 1 “head” node, and 7 “worker” nodes.

To enable multi-node, set the MISTRALRS_MN_GLOBAL_WORLD_SIZE=<number> environment variable to the total number of GPUs in all nodes, including “head” and “worker“s. Note: This number must be a power of 2.

It is recommended to use server mode with mistral.rs when in multi-node. Currently, you must send requests to every node!

The following environment variables must be set for each node:

Head node:

Name	Function	Usage
`MISTRALRS_MN_HEAD_NUM_WORKERS=<number>`	The number of worker nodes which will be connected.	This should be the number of nodes in the system, minus 1 for the head node.
`MISTRALRS_MN_HEAD_PORT=<PORT>`	The port on which to communicate with the worker nodes.	Worker nodes will connect to this port via TCP sockets

Worker node:

Name	Function	Usage
`MISTRALRS_MN_WORKER_ID=<number>`	The 0-indexed worker ID for this worker node.	If there are 4 nodes (1 head, 3 workers), then the worker ids will be 0, 1, and 2
`MISTRALRS_MN_WORKER_SERVER_ADDR=<ADDR>:<PORT>`	The IP address and port to connect to the server.	This is used to establish communication with the head node.

Ring backend in mistral.rs

Mistral.rs provides a TCP-based ring backend for distributed tensor-parallel inference. This backend is enabled by compiling with the ring feature and implements collective operations over a ring topology using TCP sockets.

Prerequisites

Build with the ring feature enable, in addition to any others:
```
cargo build --release --features ring
```
Ensure the specified TCP ports are open and reachable between processes.
The world_size must be a power of 2 (2, 4, 8, 16, etc.) for correct operation.

Configuration

Create one JSON configuration file per process with the following fields:

Field	Type	Description
`master_ip`	string	Optional. IP address for master node.
`master_port`	integer	Optional. Port for master node.
`port`	integer	Local port to bind for incoming connections from the left neighbor.
`right_port`	integer	Port on which the right neighbor is listening (used to connect outgoing to the right).
`right_ip`	string	Optional. IP address of the right neighbor (defaults to `0.0.0.0`).
`rank`	integer	Rank of this process in `[0..world_size)`.
`world_size`	integer	Total number of processes in the ring. Must be a power of 2 (e.g., 2, 4, 8, 16, etc.).

This address and port should form a ring topology for each of the nodes. For example, the last node should point to the first node as its right neighbor.

Although all processes participate in collective communication, Rank 0 acts as the master node. For example, interactive mode or the server runs on Rank 0, while other ranks act as background workers.

Example ring topology:

+---------+         +---------+
| Rank 0  | ----->  | Rank 1  |
| IP: A   |         | IP: B   |
| Port: X |         | Port: Y |
+----+----+         +----+----+
     ^                   |
     |                   v
+----+----+         +----+----+
| Rank 3  | <-----  | Rank 2  |
| IP: D   |         | IP: C   |
| Port: W |         | Port: Z |
+---------+         +---------+

Each node connects to its right neighbor by IP and port, and the last node wraps around to the first.

Example for two processes:

ring_0.json:

{
  "master_ip": "0.0.0.0",
  "master_port": 1234,
  "port": 12345,
  "right_port": 12346,
  "rank": 0,
  "world_size": 2
}

ring_0.json:

{
  "master_ip": "0.0.0.0",
  "master_port": 1234,
  "port": 12346,
  "right_port": 12345,
  "rank": 1,
  "world_size": 2
}

Multi-Machine Example

To run on different machines, update the right_ip field in each config to the actual IP address of the neighbor process. For example, if you have two machines with IPs 192.168.1.10 and 192.168.1.11:

ring_0.json on Machine A (192.168.1.10):

{
  "port": 12345,
  "right_port": 12346,
  "right_ip": "192.168.1.11",
  "rank": 0,
  "world_size": 2
}

ring_1.json on Machine B (192.168.1.11):

{
  "port": 12346,
  "right_port": 12345,
  "right_ip": "192.168.1.10",
  "rank": 1,
  "world_size": 2
}

Make sure that the specified ports are open and that each machine can reach the other via TCP on those ports.

Usage

Set the RING_CONFIG environment variable to point to the JSON file for each process, then run your application built with the ring feature:

# Process 0 or computer 0
export RING_CONFIG=path/to/ring_0.json
cargo run --release --features ring -- ...

# Process 1 or computer 1
export RING_CONFIG=path/to/ring_1.json
cargo run --release --features ring -- ...

The ring backend will automatically handle collective communication for tensor-parallel inference.

Tool calling

Tool calling makes LLMs smarter.

LLMs use tool calling to interact with the outside world. Mistral.rs has OpenAI compatible support for tool calling in all APIs, HTTP, Python, and Rust.

Note that some models, such as Mistral Small/Nemo models, require a chat template to be specified. For example:

mistralrs serve -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503

OpenAI docs: https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models

We support the following models’ tool calling in OpenAI-compatible and parse native tool calling:

Llama 4
Llama 3.1/3.2/3.3
Mistral Small (including 3.1 + multimodal)
Mistral Nemo
Hermes 2 Pro
Hermes 3
DeepSeek V2/V3/R1
Qwen 3

All models that support tool calling will respond according to the OpenAI tool calling API.

OpenAI compatible HTTP example

Please see our example here.

OpenAI docs: https://platform.openai.com/docs/api-reference/chat/create?lang=curl

Rust example

Please see our example here.

Python example

Please see our notebook here.

Tool callbacks

You can override tool execution using a tool callback. The callback receives the tool name and a dictionary of arguments and must return the tool output as a string.

Python

def tool_cb(name: str, args: dict) -> str:
    if name == "local_search":
        return json.dumps(local_search(args.get("query", "")))
    return ""

runner = Runner(
    which=Which.Plain(model_id="YourModel/ID", arch=Architecture.Llama),
    tool_callback=tool_cb,
)

See custom_search.py for a full example. In Rust pass .with_tool_callback(...) to the builder as demonstrated in tool_callback/main.rs.

Search callbacks

Web search uses a DuckDuckGo-based callback by default. Provide your own search function with search_callback in Python or .with_search_callback(...) in Rust. Each callback should return a list of results with title, description, url and content fields. See WEB_SEARCH.md for more details and examples.

Web search tool in mistral.rs

mistral.rs is compatible with OpenAI’s web_search_options parameter! Once enabled, this allows web searching for models.

This works with all models that support tool calling. However, your mileage may vary depending on the specific model. The following models work during testing and are recommended for usage:

Hermes 3 3b/8b
Mistral 3 24b
Llama 4 Scout/Maverick
Qwen 3 (⭐ Recommended!)

Web search is supported both in streaming and completion responses! This makes it easy to integrate and test out in interactive mode!

Besides tool calling and parsing of web content, we also use an embedding model to select the most relevant search results.

You can use the web search tool in all the APIs: Python, Rust, and server.

Selecting a search embedding model

Internally, we now use google/embeddinggemma-300m to embed documents for ranking. You can pick from the built-in reranker variants (currently just embedding_gemma) in every API:

Rust: with_search(SearchEmbeddingModel::EmbeddingGemma300M) in the builder
Python: search_embedding_model="embedding_gemma" in the Runner
Server: --search-embedding-model embedding_gemma flag

Specifying a custom search callback

By default, mistral.rs uses a DuckDuckGo-based search callback. To override this, you can provide your own search function:

Rust: use .with_search_callback(...) on the model builder with an Arc<dyn Fn(&SearchFunctionParameters) -> anyhow::Result<Vec<SearchResult>> + Send + Sync>.
Python: pass the search_callback keyword argument to Runner, which should be a function def search_callback(query: str) -> List[Dict[str, str]] returning a list of results with keys "title", "description", "url", and "content".

Example in Python:

def search_callback(query: str) -> list[dict[str, str]]:
    # Implement your custom search logic here, returning a list of result dicts
    return [
        {
            "title": "Example Result",
            "description": "An example description",
            "url": "https://example.com",
            "content": "Full text content of the page",
        },
        # more results...
    ]

from mistralrs import Runner, Which, Architecture
runner = Runner(
    which=Which.Plain(model_id="YourModel/ID", arch=Architecture.Mistral),
    enable_search=True,
    search_callback=search_callback,
)

HTTP server

Be sure to add --enable-search!

Here are some examples using various models. Note that this works for both streaming and completion requests, so interactive mode is featured here!

mistralrs run --enable-search --isq 4 -m Qwen/Qwen3-4B

mistralrs serve --enable-search -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503

mistralrs run --enable-search --isq 4 -m NousResearch/Hermes-3-Llama-3.1-8B

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

messages = [
    {
        "role": "user",
        "content": "Can you show me some code using mistral.rs for running Llama 3.2 Vision?",
    }
]

completion = client.chat.completions.create(
    model="default",
    messages=messages,
    tool_choice="auto",
    max_tokens=1024,
    web_search_options={},
)

# print(completion.usage)
print(completion.choices[0].message.content)

if completion.choices[0].message.tool_calls is not None:
    # Should never happen.
    tool_called = completion.choices[0].message.tool_calls[0].function
    print(tool_called)

Python SDK

from mistralrs import (
    Runner,
    Which,
    ChatCompletionRequest,
    Architecture,
    WebSearchOptions,
)

# Define a custom search callback if desired
def my_search_callback(query: str) -> list[dict[str, str]]:
    # Fetch or compute search results here
    return [
        {
            "title": "Mistral.rs GitHub",
            "description": "Official mistral.rs repository",
            "url": "https://github.com/EricLBuehler/mistral.rs",
            "content": "mistral.rs is a Rust binding for Mistral models...",
        },
    ]

runner = Runner(
    which=Which.Plain(
        model_id="NousResearch/Hermes-3-Llama-3.1-8B",
        arch=Architecture.Llama,
    ),
    enable_search=True,
    search_callback=my_search_callback,
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": "Can you show me some code using mistral.rs for running Llama 3.2 Vision?",
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
        web_search_options=WebSearchOptions(
            search_context_size=None, user_location=None
        ),
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

use anyhow::Result;
use mistralrs::{
    SearchEmbeddingModel, IsqType, RequestBuilder, TextMessageRole, TextMessages, TextModelBuilder,
    WebSearchOptions,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("NousResearch/Hermes-3-Llama-3.1-8B")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_search(SearchEmbeddingModel::default())
        .build()
        .await?;

    let messages = TextMessages::new().add_message(
        TextMessageRole::User,
        "What is the weather forecast for Boston?",
    );
    let messages =
        RequestBuilder::from(messages).with_web_search_options(WebSearchOptions::default());

    let response = model.send_chat_request(messages).await?;

    println!("What is the weather forecast for Boston?\n\n");
    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Chat templates and tokenizer customization

JINJA chat templates (recommended method)

Some models do not come with support for tool calling or other features, and as such it might be necessary to specify your own chat template.

We provide some chat templates here, and it is easy to modify or create others to customize chat template behavior.

To use this, add the jinja-explicit parameter to the various APIs

mistralrs serve -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503

Chat template overrides

Mistral.rs attempts to automatically load a chat template from the tokenizer_config.json file. This enables high flexibility across instruction-tuned models and ensures accurate chat templating. However, if the chat_template field is missing, then a JINJA chat template should be provided. The JINJA chat template may use messages, add_generation_prompt, bos_token, eos_token, and unk_token as inputs.

We provide some chat templates here, and it is easy to modify or create others to customize chat template behavior.

For example, to use the chatml template, --chat-template is specified before the model architecture. For example:

mistralrs serve -p 1234 --log output.log --chat-template ./chat_templates/chatml.json -m meta-llama/Llama-3.2-3B-Instruct

Note: For GGUF models, the chat template may be loaded directly from the GGUF file by omitting any other chat template sources.

Tokenizer

Some models do not provide a tokenizer.json file although mistral.rs expects one. To solve this, please run this script. It will output the tokenizer.json file for your specific model. This may be used by passing the --tokenizer-json flag after the model architecture. For example:

$ python3 scripts/get_tokenizers_json.py
Enter model ID: microsoft/Orca-2-13b
$ mistralrs serve -p 1234 --log output.log -m microsoft/Orca-2-13b --tokenizer-json tokenizer.json

Putting it all together, to run, for example, an Orca model (which does not come with a tokenizer.json or chat template):

Generate the tokenizer.json by running the script at scripts/get_tokenizers_json.py. This will output some files including tokenizer.json in the working directory.
Find and copy the correct chat template from chat-templates to the working directory (eg., cp chat_templates/chatml.json .)
Run mistralrs serve, specifying the tokenizer and chat template: mistralrs serve -p 1234 --log output.txt --chat-template chatml.json -m microsoft/Orca-2-13b -t tokenizer.json

Note: For GGUF models, the tokenizer may be loaded directly from the GGUF file by omitting the tokenizer model ID.

Sampling and penalty techniques in mistral.rs

mistral.rs supports a comprehensive set of sampling and penalty techniques to control text generation. These can be configured via the HTTP API, Python SDK, or Rust SDK.

Temperature

Controls the randomness of token selection. Lower values make output more deterministic, higher values increase creativity and randomness.

Range: 0.0 to 2.0 (typically 0.0 to 1.0)
Default: Model-dependent, usually around 0.7
Effect: At 0.0, always selects the most likely token (greedy). At higher values, sampling becomes more diverse.

Top K

Limits token selection to the K most likely tokens.

Range: 1 to vocabulary size
Effect: Lower values restrict choices to only the most probable tokens, reducing randomness.

Top P (Nucleus Sampling)

Limits token selection to the smallest set of tokens whose cumulative probability exceeds P.

Range: 0.0 to 1.0
Effect: At 0.1, only tokens comprising the top 10% probability mass are considered. More adaptive than Top K as it adjusts based on the probability distribution.

Min P

Filters out tokens with probability less than min_p * max_probability.

Range: 0.0 to 1.0
Effect: Removes low-probability tokens relative to the most likely token. Useful for preventing unlikely tokens from being selected.

Stop Sequences

Strings that, when generated, cause generation to stop immediately.

Type: Array of strings
Effect: Generation terminates as soon as any stop sequence is produced. Useful for controlling output boundaries.

Repetition Penalty

Applies a multiplicative penalty to tokens that have already appeared in the context.

Range: Typically 1.0 to 2.0
Effect: Values > 1.0 make repeated tokens less likely. This is distinct from frequency and presence penalties.

Frequency Penalty

Penalizes tokens based on how many times they’ve appeared in the generated text so far.

Range: -2.0 to 2.0
Effect: Positive values reduce repetition proportionally to token frequency. Negative values encourage repetition.

Presence Penalty

Penalizes tokens that have appeared at least once in the generated text.

Range: -2.0 to 2.0
Effect: Positive values discourage any repetition (binary penalty). Negative values encourage reusing tokens.

DRY (Don’t Repeat Yourself) Penalty

An advanced anti-repetition technique that detects and penalizes repeated sequences of tokens, not just individual tokens. See the original implementation for details.

DRY Parameters

dry_multiplier: Controls the strength of the penalty. Higher values more strongly discourage repetition.
dry_base: Base value for the exponential penalty calculation.
dry_allowed_length: Minimum sequence length before the penalty applies. Sequences shorter than this are not penalized.
dry_sequence_breakers: Array of tokens (like newlines, punctuation) that reset the sequence tracking. When these tokens appear, the DRY penalty starts fresh.

Example DRY Configuration (HTTP API)

{
  "dry_multiplier": 0.8,
  "dry_base": 1.75,
  "dry_allowed_length": 2,
  "dry_sequence_breakers": ["\n", ".", "!", "?", ";"]
}

API Usage

All sampling parameters can be set in API requests:

HTTP API

{
  "model": "default",
  "messages": [...],
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "min_p": 0.05,
  "repetition_penalty": 1.1,
  "frequency_penalty": 0.5,
  "presence_penalty": 0.5,
  "stop": ["END", "\n\n"],
  "dry_multiplier": 0.8,
  "dry_base": 1.75,
  "dry_allowed_length": 2,
  "dry_sequence_breakers": ["\n"]
}

Python SDK

response = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[...],
        temperature=0.7,
        top_p=0.9,
        top_k=40,
        min_p=0.05,
        repetition_penalty=1.1,
        frequency_penalty=0.5,
        presence_penalty=0.5,
        stop_seqs=["END", "\n\n"],
        dry_multiplier=0.8,
        dry_base=1.75,
        dry_allowed_length=2,
        dry_sequence_breakers=["\n"],
    )
)

Please suggest more sampling techniques by raising an issue!

Structured model loading with .toml files

Mistral.rs supports loading models from a .toml file, and the fields are the same as for the CLI. Please find some example toml selectors here.

There are a few cases which add functionality that cannot be found in the CLI.

Speculative decoding

What to specify

Under [speculative]

Specify the gamma parameter

Under [speculative.draft_model]

Choose a draft model, just like under [model] (only requirement is that they have the same tokenizer)

[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"

[speculative]
gamma = 32

[speculative.draft_model]
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
quantized_model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
quantized_filename = "mistral-7b-instruct-v0.1.Q2_K.gguf"

mistralrs from-config -f toml-selectors/speculative-gguf.toml

AnyMoE

What to specify

Under [anymoe], required unless specified

Specify the dataset
Find and specify the prefix/mlp values
- Go to https://huggingface.co/<MODEL ID>/tree/main?show_file_info=model.safetensors.index.json
- Look for the mlp layers: For example model.layers.27.mlp.down_proj.weight means that the prefix is model.layers and the mlp is mlp.
Specify the expert or LoRA adapter model IDs
(Optional) Specify layers to apply AnyMoE to.

Under [anymoe.config]

Hidden size, typically found at https://huggingface.co/<BASE MODEL ID>/blob/main/config.json

(For LoRA experts) Under [anymoe.config.expert_type.lora_adapter]

Rank
Alpha
Target modules

mistralrs from-config -f toml-selectors/anymoe.toml

With fine-tuned experts

[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"

[anymoe]
dataset_json = "test.csv"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["HuggingFaceH4/zephyr-7b-beta"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096
expert_type = "fine_tuned"

With LoRA adapter experts

[model]
model_id = "HuggingFaceH4/zephyr-7b-beta"
arch = "mistral"

[anymoe]
dataset_json = "test.csv"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["EricB/example_adapter"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096

[anymoe.config.expert_type.lora_adapter]
rank = 16
alpha = 16
target_modules = ["gate_proj"]

Multi-Model Support

The mistralrs CLI supports loading and serving multiple models simultaneously, allowing you to switch between different models in the same server instance.

Each model runs in its own engine thread
Models can have different configurations (quantization, device layers, etc.)
Memory usage scales with the number of loaded models
All models share the same server configuration (port, logging, etc.)
Interactive mode uses the default model or the first model if no default is set
You can unload all models (including the last one) - they will auto-reload when accessed

Usage

Single-Model Mode (Default)

# Traditional usage - loads one model
mistralrs serve -p 1234 -m meta-llama/Llama-3.2-3B-Instruct

Multi-Model Mode

# Load multiple models from configuration file
mistralrs from-config --file config.toml

Configuration File Format

Create a JSON file with model configurations as object keys:

{
  "llama3-3b": {
    "alias": "llama3-3b",
    "Plain": {
      "model_id": "meta-llama/Llama-3.2-3B-Instruct"
    }
  },
  "qwen3-4b": {
    "alias": "qwen3-4b",
    "Plain": {
      "model_id": "Qwen/Qwen3-4B"
    },
    "in_situ_quant": "Q4K"
  }
}

Configuration Structure

Object keys (e.g., "llama3-3b", "qwen3-4b"): Organizational labels (for human readability)
API identifiers: By default the pipeline name (usually the model_id inside the model spec). You can override this with alias.
Model specification: The model type and configuration (same format as CLI subcommands)
Optional fields:
- alias: Custom model ID (nickname) used in API requests
- chat_template: Custom chat template
- jinja_explicit: JINJA template file
- num_device_layers: Device layer configuration
- in_situ_quant: In-situ quantization setting

How API identifiers work:

✅ Object keys are organizational only (for config readability)
✅ If alias is set, it becomes the API model ID
✅ Otherwise, the pipeline name (usually the model_id field) is used
✅ The canonical pipeline name remains accepted as an alias for compatibility

API Usage

Selecting Models in Requests

Use the model field in your requests to specify which model to use:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-3b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Default Model Behavior

Explicit model: Use the alias if configured (e.g., "llama3-3b"), otherwise the full pipeline name (e.g., "meta-llama/Llama-3.2-3B-Instruct")
Default model: Use "default" to explicitly request the default model
Auto-fallback: If the model field is omitted entirely, the default model will be used

# Use default model explicitly
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The default model is either:

The model specified with --default-model-id when starting the server
The first model loaded (if no default is explicitly set)

List Available Models

curl http://localhost:1234/v1/models

Returns:

{
  "object": "list",
  "data": [
    {
      "id": "default",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    },
    {
      "id": "llama3-3b",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    },
    {
      "id": "qwen3-4b", 
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    }
  ]
}

Note: The "default" model is always listed first and represents the server’s default model. If aliases are configured, they will appear in the list while the canonical pipeline names remain accepted.

CLI Arguments

Use the multi-model subcommand with these options:

--config <PATH> (required): Path to the JSON configuration file
--default-model-id <ID> (optional): Default model ID for requests that don’t specify a model (alias or pipeline name)

New syntax:

mistralrs from-config --file <CONFIG>

Examples

Example 1: Text Models

{
  "llama3-3b": {
    "Plain": {
      "model_id": "meta-llama/Llama-3.2-3B-Instruct"
    }
  },
  "qwen3-4b": {
    "Plain": {
      "model_id": "Qwen/Qwen3-4B"
    },
    "in_situ_quant": "Q4K"
  }
}

Example 2: Mixed Model Types

{
  "text-model": {
    "Plain": {
      "model_id": "meta-llama/Llama-3.2-3B-Instruct"
    }
  },
  "vision-model": {
    "VisionPlain": {
      "model_id": "google/gemma-3-4b-it"
    }
  }
}

Example 3: GGUF Models

{
  "llama-gguf": {
    "GGUF": {
      "tok_model_id": "meta-llama/Llama-3.2-3B-Instruct",
      "quantized_model_id": "bartowski/Llama-3.2-3B-Instruct-GGUF",
      "quantized_filename": "Llama-3.2-3B-Instruct-Q4_K_M.gguf"
    }
  }
}

Model Unloading and Reloading

You can dynamically unload models to free memory and reload them on demand. This is useful for managing GPU memory when working with multiple large models.

Unload a Model

Unload a model from memory while preserving its configuration for later reload:

curl -X POST http://localhost:1234/v1/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "unloaded"
}

Reload a Model

Manually reload a previously unloaded model:

curl -X POST http://localhost:1234/v1/models/reload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "loaded"
}

Check Model Status

Get the current status of a specific model:

curl -X POST http://localhost:1234/v1/models/status \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "loaded"
}

Possible status values:

loaded: Model is loaded and ready
unloaded: Model is unloaded but can be reloaded
reloading: Model is currently being reloaded
not_found: Model ID not recognized
no_loader_config: Model cannot be reloaded (missing loader configuration)
internal_error: An internal error occurred

Auto-Reload

When a request is sent to an unloaded model, it will automatically reload before processing the request. This enables a “lazy loading” pattern where models are only loaded when needed.

List Models with Status

The /v1/models endpoint now includes status information:

curl http://localhost:1234/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "default",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    },
    {
      "id": "meta-llama/Llama-3.2-3B-Instruct",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local",
      "status": "loaded"
    },
    {
      "id": "Qwen/Qwen3-4B",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local",
      "status": "unloaded"
    }
  ]
}

Rust SDK Usage

The mistralrs crate provides MultiModelBuilder for loading multiple models and Model methods for multi-model management.

Loading Multiple Models

By default, model IDs are the pipeline names (usually the HuggingFace model path, e.g., "google/gemma-3-4b-it"). You can provide custom aliases with add_model_with_alias for shorter IDs.

use mistralrs::{IsqType, MultiModelBuilder, TextModelBuilder, VisionModelBuilder, TextMessages, TextMessageRole};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Build a multi-model instance with a vision model and a text model
    // Use aliases for shorter model IDs in requests
    let model = MultiModelBuilder::new()
        .add_model_with_alias(
            "gemma-vision",
            VisionModelBuilder::new("google/gemma-3-4b-it")  // Vision model
                .with_isq(IsqType::Q4K)
                .with_logging(),
        )
        .add_model_with_alias(
            "qwen-text",
            TextModelBuilder::new("Qwen/Qwen3-4B")  // Text model
                .with_isq(IsqType::Q4K),
        )
        .with_default_model("gemma-vision")
        .build()
        .await?;

    // Send request to default model
    let messages = TextMessages::new()
        .add_message(TextMessageRole::User, "Hello!");
    let response = model.send_chat_request(messages).await?;

    // Send request to specific model using its alias
    let messages = TextMessages::new()
        .add_message(TextMessageRole::User, "Hello from Qwen!");
    let response = model.send_chat_request_with_model(messages, Some("qwen-text")).await?;

    Ok(())
}

Model Management Methods

#![allow(unused)]
fn main() {
// List all models (returns aliases if configured, otherwise pipeline names)
let models = model.list_models()?;

// Get/set default model
let default = model.get_default_model_id()?;
model.set_default_model_id("qwen-text")?;

// List models with status
let status = model.list_models_with_status()?;
// Returns Vec<(String, ModelStatus)> where ModelStatus is Loaded, Unloaded, or Reloading

// Check if a model is loaded
let is_loaded = model.is_model_loaded("gemma-vision")?;

// Unload a model to free memory
model.unload_model("gemma-vision")?;

// Reload when needed
model.reload_model("gemma-vision").await?;
}

Available `_with_model` Methods

All request methods have _with_model variants that accept an optional model ID:

send_chat_request_with_model(request, model_id: Option<&str>)
stream_chat_request_with_model(request, model_id: Option<&str>)
generate_image_with_model(..., model_id: Option<&str>)
generate_speech_with_model(prompt, model_id: Option<&str>)
generate_embeddings_with_model(request, model_id: Option<&str>)
tokenize_with_model(..., model_id: Option<&str>)
detokenize_with_model(..., model_id: Option<&str>)
config_with_model(model_id: Option<&str>)
max_sequence_length_with_model(model_id: Option<&str>)
re_isq_model_with_model(isq_type, model_id: Option<&str>)

When model_id is None, the default model is used. If aliases are configured, you can pass either the alias or the canonical pipeline name.

Python SDK Usage

The Python Runner class supports multi-model operations directly.

Basic Usage

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture, Architecture

# Create a runner with a vision model (Gemma 3 4B)
runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3-4b-it",
        arch=VisionArchitecture.Gemma3,
    ),
    in_situ_quant="Q4K",
)

# Or create a runner with a text model (Qwen3 4B)
# runner = Runner(
#     which=Which.Plain(
#         model_id="Qwen/Qwen3-4B",
#         arch=Architecture.Qwen3,
#     ),
#     in_situ_quant="Q4K",
# )

# List models
models = runner.list_models()
print(f"Available models: {models}")

# Get/set default model
default = runner.get_default_model_id()
runner.set_default_model_id("google/gemma-3-4b-it")

# Send request with specific model_id
request = ChatCompletionRequest(
    messages=[{"role": "user", "content": "Hello!"}]
)
response = runner.send_chat_completion_request(request, model_id=models[0])

If aliases are configured (for example via the server config or Rust MultiModelBuilder), list_models() will return those aliases and you can pass them in model_id. The canonical pipeline names remain accepted.

Model Management

# List models with their status
status = runner.list_models_with_status()
# Returns list of (model_id, status) tuples

# Check if a model is loaded
is_loaded = runner.is_model_loaded("google/gemma-3-4b-it")

# Unload a model to free memory
runner.unload_model("google/gemma-3-4b-it")

# Reload when needed
runner.reload_model("google/gemma-3-4b-it")

Request Methods with model_id

All request methods accept an optional model_id parameter:

# Chat completion
response = runner.send_chat_completion_request(request, model_id="model-id")

# Completion
response = runner.send_completion_request(request, model_id="model-id")

# Embeddings
embeddings = runner.send_embedding_request(request, model_id="model-id")

# Image generation
image = runner.generate_image(prompt, response_format, model_id="model-id")

# Speech generation
audio = runner.generate_audio(prompt, model_id="model-id")

# Tokenization
tokens = runner.tokenize_text(text, add_special_tokens=True, model_id="model-id")
text = runner.detokenize_text(tokens, skip_special_tokens=True, model_id="model-id")

When model_id is None or omitted, the default model is used.

Migration Guide

From `MultiModel` (Rust)

The MultiModel struct has been removed. Use Model directly with MultiModelBuilder:

#![allow(unused)]
fn main() {
// Old (deprecated)
let multi = MultiModel::new(...);
multi.send_chat_request_to_model(request, "model-id").await?;

// New - model IDs are pipeline names by default (aliases optional)
let model = MultiModelBuilder::new()
    .add_model(VisionModelBuilder::new("google/gemma-3-4b-it"))
    .add_model(TextModelBuilder::new("Qwen/Qwen3-4B"))
    .build()
    .await?;
model.send_chat_request_with_model(request, Some("Qwen/Qwen3-4B")).await?;
}

From `MultiModelRunner` (Python)

The MultiModelRunner class has been removed. Use Runner directly:

# Old (deprecated)
multi_runner = MultiModelRunner(runner)
multi_runner.send_chat_completion_request_to_model(request, "model-id")

# New - model IDs are the registered IDs (aliases if configured)
runner = Runner(which=Which.Plain(model_id="google/gemma-3-4b-it", ...))
runner.send_chat_completion_request(request, model_id="google/gemma-3-4b-it")

MCP (Model Context Protocol) Client

mistral.rs includes a built-in MCP client that allows models to connect to external tools and services through the Model Context Protocol. This enables automatic tool discovery and usage from any MCP-compatible server.

Quick Start

Examples below show HTTP (Hugging Face), Process (filesystem), and WebSocket transports. Replace hf_xxx with your actual Hugging Face token for HTTP examples.

Rust SDK

use mistralrs::{
    TextModelBuilder, McpClientConfig, McpServerConfig, McpServerSource,
    TextMessages, TextMessageRole,
};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Process example (filesystem server - recommended for getting started)
    let mcp_config = McpClientConfig {
        servers: vec![McpServerConfig {
            name: "Filesystem Tools".to_string(),
            source: McpServerSource::Process {
                command: "npx".to_string(),
                args: vec!["@modelcontextprotocol/server-filesystem".to_string(), ".".to_string()],
                work_dir: None,
                env: None,
            },
            ..Default::default()
        }],
        auto_register_tools: true,
        ..Default::default()
    };

    // Alternative HTTP example (Hugging Face MCP server)
    let _mcp_config_http = McpClientConfig {
        servers: vec![McpServerConfig {
            id: "hf_server".to_string(),
            name: "Hugging Face MCP".to_string(),
            source: McpServerSource::Http {
                url: "https://hf.co/mcp".to_string(),
                timeout_secs: Some(30),
                headers: None,
            },
            enabled: false, // Disabled by default
            tool_prefix: Some("hf".to_string()),
            resources: None,
            bearer_token: Some("hf_xxx".to_string()), // Your HF token
        }],
        auto_register_tools: true,
        tool_timeout_secs: Some(30),
        max_concurrent_calls: Some(5),
    };

    // Alternative WebSocket example
    let _mcp_config_websocket = McpClientConfig {
        servers: vec![McpServerConfig {
            name: "WebSocket Example".to_string(),
            source: McpServerSource::WebSocket {
                url: "wss://api.example.com/mcp".to_string(),
                timeout_secs: Some(30),
                headers: None,
            },
            enabled: false, // Disabled by default
            ..Default::default()
        }],
        auto_register_tools: true,
        ..Default::default()
    };

    // Build model with MCP support
    let model = TextModelBuilder::new("Qwen/Qwen3-4B")
        .with_mcp_client(mcp_config)
        .build()
        .await?;

    // Use the model - tools are automatically available
    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::User,
            "List the files in the current directory and create a test.txt file"
        );

    let response = model.send_chat_request(messages).await?;
    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    
    Ok(())
}

Python SDK

import mistralrs

# Process example (filesystem server - recommended for getting started)
filesystem_server = mistralrs.McpServerConfigPy(
    name="Filesystem Tools",
    source=mistralrs.McpServerSourcePy.Process(
        command="npx",
        args=["@modelcontextprotocol/server-filesystem", "."],
        work_dir=None,
        env=None
    )
)

# Alternative HTTP example (Hugging Face MCP server)
hf_server = mistralrs.McpServerConfigPy(
    id="hf_server",
    name="Hugging Face MCP",
    source=mistralrs.McpServerSourcePy.Http(
        url="https://hf.co/mcp",
        timeout_secs=30,
        headers=None
    ),
    enabled=False,  # Disabled by default
    tool_prefix="hf",
    resources=None,
    bearer_token="hf_xxx"  # Your HF token
)

# Alternative WebSocket example
websocket_server = mistralrs.McpServerConfigPy(
    name="WebSocket Example",
    source=mistralrs.McpServerSourcePy.WebSocket(
        url="wss://api.example.com/mcp",
        timeout_secs=30,
        headers=None
    ),
    enabled=False  # Disabled by default
)

# Create MCP client config using filesystem server (others are disabled)
mcp_config = mistralrs.McpClientConfigPy(
    servers=[filesystem_server], # hf_server, websocket_server can be added when enabled
    auto_register_tools=True,
    tool_timeout_secs=30,
    max_concurrent_calls=5
)

# Build model with MCP support
runner = mistralrs.Runner(
    which=mistralrs.Which.Plain(
        model_id="Qwen/Qwen3-4B",
        arch=mistralrs.Architecture.Qwen3,
    ),
    mcp_client_config=mcp_config
)

# Use the model - tools are automatically available
res = runner.send_chat_completion_request(
    mistralrs.ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "List the files in the current directory and create a test.txt file"}
        ],
        max_tokens=500,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)

HTTP API

Create mcp-config.json:

Process Example (Recommended for getting started):

{
  "servers": [{
    "name": "Filesystem Tools",
    "source": {
      "type": "Process",
      "command": "npx",
      "args": ["@modelcontextprotocol/server-filesystem", "."]
    }
  }],
  "auto_register_tools": true
}

Note: To install the filesystem server, run: npx @modelcontextprotocol/server-filesystem . -y

HTTP Example (Hugging Face MCP Server):

{
  "servers": [
    {
      "name": "Hugging Face MCP",
      "source": {
        "type": "Http",
        "url": "https://hf.co/mcp",
        "timeout_secs": 30
      },
      "bearer_token": "hf_xxx",
      "tool_prefix": "hf",
      "enabled": false
    },
    {
      "name": "Filesystem Tools",
      "source": {
        "type": "Process",
        "command": "npx",
        "args": ["@modelcontextprotocol/server-filesystem", "."]
      }
    }
  ],
  "auto_register_tools": true,
  "tool_timeout_secs": 30,
  "max_concurrent_calls": 5
}

WebSocket Example:

{
  "servers": [
    {
      "name": "WebSocket Example",
      "source": {
        "type": "WebSocket",
        "url": "wss://api.example.com/mcp",
        "timeout_secs": 30
      },
      "enabled": false
    },
    {
      "name": "Filesystem Tools",
      "source": {
        "type": "Process",
        "command": "npx",
        "args": ["@modelcontextprotocol/server-filesystem", "."]
      }
    }
  ],
  "auto_register_tools": true
}

Start server with MCP:

mistralrs serve \
  -p 1234 \
  --mcp-config mcp-config.json \
  -m Qwen/Qwen3-4B

Use the API:

curl -X POST http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "messages": [
      {"role": "user", "content": "List the files in the current directory and create a test.txt file"}
    ],
    "max_tokens": 500,
    "temperature": 0.1
  }'

Key Features

Automatic Tool Discovery: Tools are discovered from MCP servers at startup
Multi-Server Support: Connect to multiple MCP servers simultaneously
Transport Flexibility: HTTP, WebSocket, and Process transports supported
Authentication: Bearer token support for secure connections
Tool Prefixing: Avoid naming conflicts between servers
Concurrency Control: Limit parallel tool executions
Timeout Management: Control individual tool execution timeouts

Next Steps

Configuration Reference - Detailed configuration options
Transport Types - HTTP, WebSocket, and Process transports
Advanced Usage - Multi-server setups, custom headers, and more
MCP Server Development - Building your own MCP server

Common MCP Servers

Filesystem: @modelcontextprotocol/server-filesystem - Local file operations (Process)
Hugging Face: https://hf.co/mcp - Access HF models, datasets, and spaces (HTTP)
Postgres: @modelcontextprotocol/server-postgres - Database operations (Process)

Additional servers (install separately):

Brave Search - Web search capabilities
GitHub - GitHub API access

Replace placeholder tokens and URLs with actual values for your use case.

Troubleshooting

Common Issues

“MCP server failed to start” or “npx command not found”

Install Node.js and npm: curl -fsSL https://deb.nodesource.com/setup_lts.x | sudo -E bash - && sudo apt-get install -y nodejs
Install the filesystem server: npx @modelcontextprotocol/server-filesystem . -y

“No tools available” or “tools_available: false”

Check server logs for MCP connection errors
Verify the MCP config file path is correct
Ensure the MCP server process is running: ps aux | grep mcp

“Tool call failed” or timeout errors

Increase tool_timeout_secs in your config (default: 30)
Check max_concurrent_calls setting (start with 1-5)
Verify file permissions for filesystem operations

Authentication errors with HTTP servers

Double-check bearer_token values (e.g., HF tokens start with hf_)
Verify API endpoints are accessible: curl -H "Authorization: Bearer YOUR_TOKEN" https://hf.co/mcp

Need help?

MCP Server Registry - Find more servers
Discord Community - Get support

MCP protocol support

mistralrs serve can speak the MCP – Model-Control-Protocol in addition to the regular OpenAI-compatible REST API.

At a high-level, MCP is an opinionated, tool-based JSON-RPC 2.0 protocol that lets clients interact with models through structured tool calls instead of specialised HTTP routes.
The implementation in Mistral.rs is powered by rust-mcp-sdk and automatically registers tools based on the modalities supported by the loaded model (text, vision, …).

Exposed tools:

Tool	Minimum `input` -> `output` modalities	Description
`chat`	`Text` -> `Text`	Wraps the OpenAI `/v1/chat/completions` endpoint

Running

Start the normal HTTP server and add the --mcp-port flag to expose an MCP endpoint in parallel on a separate port:

mistralrs serve \
  -p 1234 \
  --mcp-port 4321 \
  -m mistralai/Mistral-7B-Instruct-v0.3

Check if it’s working

The following curl command lists the tools advertised by the server and therefore serves as a quick smoke-test:

curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/list",
  "params": {}
}'

Example clients

Python

The reference Python SDK can be installed via:

pip install --upgrade mcp

Here is a minimal end-to-end example that initialises a session, lists the available tools and finally sends a chat request:

import asyncio

from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client


SERVER_URL = "http://localhost:4321/mcp"


async def main() -> None:
    # The helper creates an SSE (Server-Sent-Events) transport under the hood
    async with streamablehttp_client(SERVER_URL) as (read, write, _):
        async with ClientSession(read, write) as session:

            # --- INITIALIZE ---
            init_result = await session.initialize()
            print("Server info:", init_result.serverInfo)

            # --- LIST TOOLS ---
            tools = await session.list_tools()
            print("Available tools:", [t.name for t in tools.tools])

            # --- CALL TOOL ---
            resp = await session.call_tool(
                "chat",
                arguments={
                    "messages": [
                        {"role": "user", "content": "Hello MCP 👋"},
                        {"role": "assistant", "content": "Hi there!"}
                    ],
                    "maxTokens": 50,
                    "temperature": 0.7,
                },
            )
            # resp.content is a list[CallToolResultContentItem]; extract text parts
            text = "\n".join(c.text for c in resp.content if c.type == "text")
            print("Model replied:", text)

if __name__ == "__main__":
    asyncio.run(main())

Rust

use anyhow::Result;
use rust_mcp_sdk::{
    mcp_client::client_runtime,
    schema::{
        CallToolRequestParams, ClientCapabilities, CreateMessageRequest,
        Implementation, InitializeRequestParams, Message, LATEST_PROTOCOL_VERSION,
    },
    ClientSseTransport, ClientSseTransportOptions,
};

struct Handler;
#[async_trait::async_trait]
impl rust_mcp_sdk::mcp_client::ClientHandler for Handler {}

#[tokio::main]
async fn main() -> Result<()> {
    let transport = ClientSseTransport::new(
        "http://localhost:4321/mcp",
        ClientSseTransportOptions::default(),
    )?;

    let details = InitializeRequestParams {
        capabilities: ClientCapabilities::default(),
        client_info: Implementation { name: "mcp-client".into(), version: "0.1".into() },
        protocol_version: LATEST_PROTOCOL_VERSION.into(),
    };

    let client = client_runtime::create_client(details, transport, Handler);
    client.clone().start().await?;

    let req = CreateMessageRequest {
        model: "mistralai/Mistral-7B-Instruct-v0.3".into(),
        messages: vec![Message::user("Explain Rust ownership.")],
        ..Default::default()
    };

    let result = client
        .call_tool(CallToolRequestParams::new("chat", req.into()))
        .await?;

    println!("{}", result.content[0].as_text_content()?.text);
    client.shut_down().await?;
    Ok(())
}

HTTP

Call a tool:

curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "chat",
    "arguments": {
    "messages": [
      { "role": "system",    "content": "You are a helpful assistant." },
      { "role": "user",      "content": "Hello, what’s the time?" }
    ],
    "maxTokens": 50,
    "temperature": 0.7
  }
  }
}'

Initialize:

curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "initialize",
  "params": {}
}'

List tools:

curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/list",
  "params": {}
}'

Limitations & roadmap

The MCP support that ships with the current Mistral.rs release focuses on the happy-path. A few niceties have not yet been implemented and PRs are more than welcome:

Streaming token responses (similar to the stream=true flag in the OpenAI API).
An authentication layer – if you are exposing the MCP port publicly run it behind a reverse-proxy that handles auth (e.g. nginx + OIDC).
Additional tools for other modalities such as vision or audio once the underlying crates stabilise.

If you would like to work on any of the above please open an issue first so the work can be coordinated.

MCP Configuration Reference

This page provides a complete reference for configuring the MCP client in mistral.rs.

Quick Start - Minimal Configuration

For simple use cases, you can now use a minimal configuration that leverages smart defaults:

{
  "servers": [{
    "name": "Hugging Face MCP Server",
    "source": {
      "type": "Http",
      "url": "https://hf.co/mcp"
    },
    "bearer_token": "hf_xxx"
  }]
}

This automatically provides:

UUID-based server ID: Unique identifier generated automatically
Enabled by default: Server is active without explicit enabled: true
UUID-based tool prefix: Prevents naming conflicts automatically
No timeouts: Tools and connections don’t timeout by default
Sequential execution: Only 1 concurrent tool call to prevent overwhelming servers
Auto-registration: Tools are automatically discovered and registered

Configuration Structure

McpClientConfig

The top-level configuration for the MCP client:

{
  "servers": [...],                    // Array of MCP server configurations
  "auto_register_tools": true,         // Automatically register discovered tools (default: true)
  "tool_timeout_secs": null,           // Timeout for individual tool calls, null = no timeout (default: null)
  "max_concurrent_calls": 1            // Maximum concurrent tool executions (default: 1)
}

McpServerConfig

Configuration for each MCP server:

{
  "id": "unique_id",                  // Unique identifier (default: UUID if not specified)
  "name": "Display Name",             // Human-readable name
  "source": {...},                    // Transport configuration (see below)
  "enabled": true,                    // Enable/disable this server (default: true)
  "tool_prefix": "mcp_abc123",         // Prefix for tool names (default: UUID-based if not specified)
  "resources": ["pattern"],           // Optional resource patterns
  "bearer_token": "token"             // Optional authentication token
}

Transport Source Configuration

HTTP Transport

{
  "type": "Http",
  "url": "https://api.example.com/mcp",
  "timeout_secs": null,               // Optional, null = no timeout (default)
  "headers": {                        // Optional custom headers
    "X-API-Version": "v1",
    "User-Agent": "mistral-rs/0.6.0"
  }
}

WebSocket Transport

{
  "type": "WebSocket", 
  "url": "wss://realtime.example.com/mcp",
  "timeout_secs": null,               // Optional, null = no timeout (default)
  "headers": {                        // Optional WebSocket headers
    "Origin": "https://mistral.rs",
    "Sec-WebSocket-Protocol": "mcp"
  }
}

Process Transport

{
  "type": "Process",
  "command": "mcp-server-filesystem",
  "args": ["--root", "/tmp"],         // Command arguments
  "work_dir": "/home/user",           // Optional working directory
  "env": {                            // Optional environment variables
    "MCP_LOG_LEVEL": "info"
  }
}

Field Reference

McpClientConfig Fields

Field	Type	Required	Default	Description
`servers`	Array	Yes	-	List of MCP server configurations
`auto_register_tools`	Boolean	No	`true`	Automatically discover and register tools at startup
`tool_timeout_secs`	Integer	No	`null`	Timeout in seconds for individual tool calls (null = no timeout)
`max_concurrent_calls`	Integer	No	`1`	Maximum number of concurrent tool executions

McpServerConfig Fields

Field	Type	Required	Default	Description
`id`	String	No	UUID	Unique identifier for the server (UUID generated if not provided)
`name`	String	Yes	-	Human-readable server name
`source`	Object	Yes	-	Transport configuration
`enabled`	Boolean	No	`true`	Whether to connect to this server
`tool_prefix`	String	No	UUID-based	Prefix to add to all tool names (UUID-based if not provided)
`resources`	Array	No	None	Resource URI patterns to subscribe to
`bearer_token`	String	No	None	Bearer token for authentication

Transport Source Fields

HTTP Source

Field	Type	Required	Default	Description
`type`	String	Yes	-	Must be “Http”
`url`	String	Yes	-	HTTP/HTTPS URL of the MCP server
`timeout_secs`	Integer	No	`null`	Request timeout in seconds (null = no timeout)
`headers`	Object	No	None	Additional HTTP headers

WebSocket Source

Field	Type	Required	Default	Description
`type`	String	Yes	-	Must be “WebSocket”
`url`	String	Yes	-	WS/WSS URL of the MCP server
`timeout_secs`	Integer	No	`null`	Connection timeout in seconds (null = no timeout)
`headers`	Object	No	None	WebSocket handshake headers

Process Source

Field	Type	Required	Default	Description
`type`	String	Yes	-	Must be “Process”
`command`	String	Yes	-	Executable command to run
`args`	Array	No	`[]`	Command line arguments
`work_dir`	String	No	Current dir	Working directory
`env`	Object	No	None	Environment variables

Authentication

Bearer Token

The bearer_token field is automatically added as an Authorization: Bearer <token> header for HTTP and WebSocket connections.

{
  "bearer_token": "hf_AbCdEfGhIjKlMnOpQrStUvWxYz"
}

Custom Headers

For other authentication schemes, use the headers field:

{
  "source": {
    "type": "Http",
    "url": "https://api.example.com/mcp",
    "headers": {
      "X-API-Key": "your-api-key",
      "X-Client-ID": "your-client-id"
    }
  }
}

Tool Naming

Without Prefix

Tools are registered with their original names:

MCP tool: search -> Registered as: search

With Prefix

When tool_prefix is set, all tools from that server get prefixed:

MCP tool: search with prefix web -> Registered as: web_search

This prevents conflicts when multiple servers provide tools with the same name.

Resource Patterns

The resources field accepts glob-like patterns:

{
  "resources": [
    "file://**/*.txt",      // All .txt files
    "file://data/**",       // Everything under data/
    "db://users/*",         // All user records
    "api://v1/metrics"      // Specific endpoint
  ]
}

Environment Variables

Using Environment Variables in Configuration

While JSON doesn’t support environment variables directly, you can use them when building configurations programmatically:

#![allow(unused)]
fn main() {
McpServerConfig {
    bearer_token: std::env::var("HF_TOKEN").ok(),
    source: McpServerSource::Http {
        url: std::env::var("MCP_SERVER_URL")
            .unwrap_or_else(|_| "https://hf.co/mcp".to_string()),
        // ...
    },
    // ...
}
}

import os

McpServerConfigPy(
    bearer_token=os.getenv("HF_TOKEN"),
    source=McpServerSourcePy.Http(
        url=os.getenv("MCP_SERVER_URL", "https://hf.co/mcp")
    )
)

Variable	Description
`MCP_CONFIG_PATH`	Path to MCP configuration file
`MCP_LOG_LEVEL`	Logging level for MCP operations
`MCP_POOL_SIZE`	Connection pool size for HTTP/WebSocket

Validation Rules

Unique Server IDs: All server id values must be unique
Valid URLs: HTTP URLs must start with http:// or https://
Valid WebSocket URLs: Must start with ws:// or wss://
Executable Commands: Process commands must be executable
Tool Name Conflicts: Use tool_prefix to avoid conflicts

Example Configurations

Single Server (Hugging Face) - Minimal

{
  "servers": [{
    "name": "Hugging Face MCP Server",
    "source": {
      "type": "Http",
      "url": "https://hf.co/mcp"
    },
    "bearer_token": "hf_xxx"
  }]
}

Single Server (Hugging Face) - Full Configuration

{
  "servers": [{
    "id": "hf",
    "name": "Hugging Face MCP",
    "source": {
      "type": "Http",
      "url": "https://hf.co/mcp",
      "timeout_secs": 30
    },
    "enabled": true,
    "tool_prefix": "hf",
    "bearer_token": "hf_xxx"
  }],
  "auto_register_tools": true,
  "tool_timeout_secs": 30,
  "max_concurrent_calls": 5
}

Multi-Server Setup

{
  "servers": [
    {
      "id": "hf",
      "name": "Hugging Face",
      "source": {"type": "Http", "url": "https://hf.co/mcp"},
      "tool_prefix": "hf",
      "bearer_token": "hf_xxx"
    },
    {
      "id": "github",
      "name": "GitHub API",
      "source": {"type": "Http", "url": "https://api.github.com/mcp"},
      "tool_prefix": "gh",
      "bearer_token": "ghp_xxx"
    },
    {
      "id": "local_fs",
      "name": "Filesystem",
      "source": {
        "type": "Process",
        "command": "mcp-server-filesystem",
        "args": ["--root", "/data", "--readonly"]
      },
      "tool_prefix": "fs"
    }
  ],
  "auto_register_tools": true,
  "tool_timeout_secs": 30,
  "max_concurrent_calls": 10
}

MCP Transport Types

mistral.rs supports three transport types for connecting to MCP servers, each optimized for different use cases.

HTTP Transport

Best for public APIs, RESTful services, and servers behind load balancers.

Configuration

{
  "source": {
    "type": "Http",
    "url": "https://api.example.com/mcp",
    "timeout_secs": 30,
    "headers": {
      "X-API-Version": "v1",
      "User-Agent": "mistral-rs/0.6.0"
    }
  },
  "bearer_token": "your-api-token"
}

Features

Server-Sent Events (SSE) support for streaming responses
Custom headers for API versioning or client identification
Bearer token authentication (added as Authorization: Bearer <token>)
Configurable timeouts
Standard HTTP semantics

Example: Hugging Face MCP

#![allow(unused)]
fn main() {
McpServerSource::Http {
    url: "https://hf.co/mcp".to_string(),
    timeout_secs: Some(30),
    headers: None,
}
}

WebSocket Transport

Best for real-time applications, bidirectional communication, and low-latency requirements.

Configuration

{
  "source": {
    "type": "WebSocket",
    "url": "wss://realtime.example.com/mcp",
    "timeout_secs": 60,
    "headers": {
      "Origin": "https://mistral.rs",
      "Sec-WebSocket-Protocol": "mcp"
    }
  },
  "bearer_token": "your-websocket-token"
}

Features

Persistent connections reduce handshake overhead
Server-initiated notifications
Lower latency for frequent tool calls
Automatic reconnection handling
WebSocket-specific headers support

Example: Real-time Data Feed

#![allow(unused)]
fn main() {
McpServerSource::WebSocket {
    url: "wss://data.example.com/mcp".to_string(),
    timeout_secs: Some(60),
    headers: Some(headers),
}
}

Process Transport

Best for local tools, development servers, and sandboxed environments.

Configuration

{
  "source": {
    "type": "Process",
    "command": "mcp-server-filesystem",
    "args": ["--root", "/tmp", "--readonly"],
    "work_dir": "/home/user/workspace",
    "env": {
      "MCP_LOG_LEVEL": "info",
      "MCP_TIMEOUT": "30"
    }
  }
}

Features

No network overhead
Process isolation for security
Direct stdin/stdout communication
Environment variable configuration
Working directory control
No authentication needed (process inherits permissions)

Example: Filesystem Server

#![allow(unused)]
fn main() {
McpServerSource::Process {
    command: "mcp-server-filesystem".to_string(),
    args: vec!["--root".to_string(), "/tmp".to_string()],
    work_dir: None,
    env: None,
}
}

Transport Selection Guide

Use Case	Recommended Transport	Why
Public APIs	HTTP	Standard auth, caching, load balancing
Local tools	Process	No network, process isolation
Real-time data	WebSocket	Low latency, server push
Corporate proxies	HTTP	Proxy support, standard ports
Development	Process	Easy debugging, no network setup
Interactive apps	WebSocket	Bidirectional, persistent connection

Security Considerations

HTTP

Always use HTTPS in production
Bearer tokens transmitted with each request
Consider token rotation strategies

WebSocket

Use WSS (WebSocket Secure) in production
Bearer token sent during handshake
Connection persists with authenticated state

Process

Inherits user permissions
Sandboxing via work_dir and env
No network exposure

Performance Tips

HTTP: Enable keep-alive, use connection pooling
WebSocket: Reuse connections, handle reconnection gracefully
Process: Minimize startup time, use long-running processes

Error Handling

All transports implement automatic retry with exponential backoff:

Initial retry: 1 second
Max retry: 60 seconds
Max attempts: 5

Custom retry behavior can be configured per server.

Advanced MCP Usage

This guide covers advanced MCP client configurations and usage patterns.

Multi-Server Configuration

Connect to multiple MCP servers simultaneously to access different tool sets:

#![allow(unused)]
fn main() {
let mcp_config = McpClientConfig {
    servers: vec![
        // Hugging Face for ML tools
        McpServerConfig {
            id: "hf_server".to_string(),
            name: "Hugging Face MCP".to_string(),
            source: McpServerSource::Http {
                url: "https://hf.co/mcp".to_string(),
                timeout_secs: Some(30),
                headers: None,
            },
            enabled: true,
            tool_prefix: Some("hf".to_string()),
            resources: None,
            bearer_token: Some("hf_xxx".to_string()),
        },
        // Local filesystem access
        McpServerConfig {
            id: "fs_server".to_string(),
            name: "Filesystem MCP".to_string(),
            source: McpServerSource::Process {
                command: "mcp-server-filesystem".to_string(),
                args: vec!["--root".to_string(), "/data".to_string()],
                work_dir: None,
                env: None,
            },
            enabled: true,
            tool_prefix: Some("fs".to_string()),
            resources: Some(vec!["file://**".to_string()]),
            bearer_token: None,
        },
        // GitHub API access
        McpServerConfig {
            id: "github_server".to_string(),
            name: "GitHub MCP".to_string(),
            source: McpServerSource::Http {
                url: "https://api.github.com/mcp".to_string(),
                timeout_secs: Some(45),
                headers: Some(HashMap::from([
                    ("Accept".to_string(), "application/vnd.github.v3+json".to_string()),
                ])),
            },
            enabled: true,
            tool_prefix: Some("gh".to_string()),
            resources: None,
            bearer_token: Some("ghp_xxx".to_string()),
        },
    ],
    auto_register_tools: true,
    tool_timeout_secs: Some(30),
    max_concurrent_calls: Some(10),
};
}

Tool Prefixing Strategy

When using multiple servers, tool prefixes prevent naming conflicts:

{
  "servers": [
    {
      "id": "server1",
      "tool_prefix": "s1",
      // Tool "search" becomes "s1_search"
    },
    {
      "id": "server2", 
      "tool_prefix": "s2",
      // Tool "search" becomes "s2_search"
    }
  ]
}

Custom Headers and Authentication

API Key in Headers

#![allow(unused)]
fn main() {
let mut headers = HashMap::new();
headers.insert("X-API-Key".to_string(), "your-api-key".to_string());
headers.insert("X-Client-Version".to_string(), "1.0.0".to_string());

McpServerSource::Http {
    url: "https://api.example.com/mcp".to_string(),
    timeout_secs: Some(30),
    headers: Some(headers),
}
}

OAuth2 Bearer Token

#![allow(unused)]
fn main() {
McpServerConfig {
    // ...
    bearer_token: Some("your-oauth2-token".to_string()),
    // Automatically added as: Authorization: Bearer your-oauth2-token
}
}

Resource Subscriptions

Subscribe to specific resource patterns from MCP servers:

#![allow(unused)]
fn main() {
McpServerConfig {
    id: "data_server".to_string(),
    // ...
    resources: Some(vec![
        "file://data/**/*.json".to_string(),  // All JSON files in data/
        "db://users/*".to_string(),            // All user records
        "api://v1/metrics".to_string(),        // Specific API endpoint
    ]),
    // ...
}
}

Concurrency and Rate Limiting

Global Concurrency Control

#![allow(unused)]
fn main() {
McpClientConfig {
    // ...
    max_concurrent_calls: Some(5),  // Max 5 tools executing simultaneously
}
}

Per-Tool Timeouts

#![allow(unused)]
fn main() {
McpClientConfig {
    // ...
    tool_timeout_secs: Some(30),  // Each tool call times out after 30s
}
}

Custom Rate Limiting

# Python example with custom rate limiting
import time
from collections import deque

class RateLimitedMcpRunner:
    def __init__(self, runner, max_calls_per_minute=60):
        self.runner = runner
        self.max_calls = max_calls_per_minute
        self.call_times = deque()
    
    def send_chat_completion_request(self, request):
        # Remove calls older than 1 minute
        now = time.time()
        while self.call_times and self.call_times[0] < now - 60:
            self.call_times.popleft()
        
        # Check rate limit
        if len(self.call_times) >= self.max_calls:
            sleep_time = 60 - (now - self.call_times[0])
            time.sleep(sleep_time)
        
        # Make the call
        self.call_times.append(now)
        return self.runner.send_chat_completion_request(request)

Environment-Specific Configuration

Development vs Production

#![allow(unused)]
fn main() {
let mcp_config = if cfg!(debug_assertions) {
    McpClientConfig {
        servers: vec![/* development servers */],
        tool_timeout_secs: Some(60),  // Longer timeouts for debugging
        max_concurrent_calls: Some(1), // Sequential execution for debugging
        // ...
    }
} else {
    McpClientConfig {
        servers: vec![/* production servers */],
        tool_timeout_secs: Some(10),   // Strict timeouts
        max_concurrent_calls: Some(20), // Higher concurrency
        // ...
    }
};
}

Environment Variables

#![allow(unused)]
fn main() {
let mcp_config = McpClientConfig {
    servers: vec![
        McpServerConfig {
            // ...
            bearer_token: std::env::var("HF_TOKEN").ok(),
            source: McpServerSource::Http {
                url: std::env::var("MCP_SERVER_URL")
                    .unwrap_or_else(|_| "https://hf.co/mcp".to_string()),
                // ...
            },
            // ...
        },
    ],
    // ...
};
}

Error Handling and Fallbacks

Graceful Degradation

#![allow(unused)]
fn main() {
let mcp_config = McpClientConfig {
    servers: vec![
        // Primary server
        McpServerConfig {
            id: "primary".to_string(),
            enabled: true,
            // ...
        },
        // Fallback server
        McpServerConfig {
            id: "fallback".to_string(),
            enabled: check_primary_health().is_err(),
            // ...
        },
    ],
    // ...
};
}

Tool-Specific Error Handling

# Handle specific tool errors
try:
    response = runner.send_chat_completion_request(request)
except Exception as e:
    if "tool_timeout" in str(e):
        print("Tool execution timed out, trying with longer timeout...")
        # Retry with extended timeout
    elif "tool_not_found" in str(e):
        print("Tool not available, falling back to built-in response...")
        # Fallback logic

Monitoring and Debugging

Enable Debug Logging

#![allow(unused)]
fn main() {
std::env::set_var("RUST_LOG", "mistralrs_mcp=debug");
env_logger::init();
}

Tool Call Inspection

#![allow(unused)]
fn main() {
let response = model.send_chat_request(messages).await?;

// Check if tools were called
if let Some(tool_calls) = &response.choices[0].message.tool_calls {
    for call in tool_calls {
        println!("Tool: {}", call.function.name);
        println!("Args: {}", call.function.arguments);
        println!("ID: {}", call.id);
    }
}
}

Performance Optimization

Connection Pooling

HTTP and WebSocket transports automatically use connection pooling. Configure pool size:

#![allow(unused)]
fn main() {
// Set via environment variable
std::env::set_var("MCP_POOL_SIZE", "10");
}

Caching Tool Responses

from functools import lru_cache
import json

@lru_cache(maxsize=100)
def cached_tool_call(tool_name, args_json):
    args = json.loads(args_json)
    # Tool execution logic
    return result

# Use with MCP tools that have deterministic outputs

Security Best Practices

Token Rotation: Implement automatic token refresh for long-running applications
Least Privilege: Only enable required tools and resources
Audit Logging: Log all tool calls for security monitoring
Network Isolation: Use Process transport for sensitive local operations
Input Validation: MCP servers should validate all tool inputs

Configuration Reference

This document covers environment variables and server configuration for mistral.rs.

Runtime Environment Variables

Variable	Description
`MISTRALRS_DEBUG=1`	Enable debug mode: outputs tensor info files for GGUF/GGML models, increases logging verbosity
`MISTRALRS_NO_MMAP=1`	Disable memory-mapped file loading, forcing all tensor data into memory
`MISTRALRS_NO_MLA=1`	Disable MLA (Multi-head Latent Attention) optimization for DeepSeek V2/V3 and GLM-4.7-Flash
`MISTRALRS_ISQ_SINGLETHREAD=1`	Force ISQ (In-Situ Quantization) to run single-threaded
`MISTRALRS_IGPU_MEMORY_FRACTION`	Memory fraction for integrated/unified-memory CUDA GPUs (e.g. NVIDIA Grace Blackwell, Jetson). Float between 0.0 and 1.0, default: `0.75`
`MCP_CONFIG_PATH`	Fallback path for MCP client configuration (used if `--mcp-config` not provided)
`KEEP_ALIVE_INTERVAL`	SSE keep-alive interval in milliseconds (default: 10000)
`HF_HUB_CACHE`	Override Hugging Face Hub cache directory

Build-Time Environment Variables

Variable	Description
`MISTRALRS_METAL_PRECOMPILE=0`	Skip Metal kernel precompilation (useful for CI)
`NVCC_CCBIN`	Set CUDA compiler path
`CUDA_NVCC_FLAGS=-fPIE`	Required on some Linux distributions
`CUDA_COMPUTE_CAP`	Override CUDA compute capability (e.g., “80” for RTX 3090)

Server Defaults

When running the HTTP server with mistralrs serve, these defaults apply:

Setting	Default Value
Server IP	`0.0.0.0` (all interfaces)
Max request body	50 MB
Max running sequences	16
Prefix cache count	16
SSE keep-alive	10 seconds
PagedAttention (CUDA)	Enabled
PagedAttention (Metal)	Disabled
PA GPU memory usage	90% of free memory
PA block size	32 tokens

Multi-Node Distributed Configuration

For multi-node setups, configure the head node and workers using environment variables.

Head Node

Variable	Description
`MISTRALRS_MN_GLOBAL_WORLD_SIZE`	Total number of devices across all nodes
`MISTRALRS_MN_HEAD_NUM_WORKERS`	Number of worker nodes
`MISTRALRS_MN_HEAD_PORT`	Port for head node communication

Worker Nodes

Variable	Description
`MISTRALRS_MN_WORKER_SERVER_ADDR`	Address of head server to connect to
`MISTRALRS_MN_WORKER_ID`	This worker’s ID
`MISTRALRS_MN_LOCAL_WORLD_SIZE`	Number of GPUs on this node
`MISTRALRS_NO_NCCL=1`	Disable NCCL (use alternative backend)

Engine Internals

This document describes internal engine behaviors in mistral.rs.

Overview

The mistral.rs engine manages model inference through a background thread pool. Each loaded model runs in its own engine thread, which handles request queuing, batching, and execution.

Warmup Run

When a text or vision model is loaded in a multi-threaded runtime, mistral.rs automatically performs a warmup (“dummy”) run:

Sends a short completion request (“hello” with max 1 token) to initialize CUDA kernels and caches
Logs “Beginning dummy run.” when starting and “Dummy run completed in Xs.” when finished
Helps ensure more consistent performance for the first real user request
Only runs for text and vision models (not diffusion/speech)

This warmup ensures that CUDA kernel compilation and memory allocation happens during model loading rather than during the first user request.

Automatic Engine Recovery

If the inference engine thread dies unexpectedly (e.g., due to a panic), mistral.rs can automatically recover:

Detects dead engine threads when sending requests
Automatically reboots the engine using saved configuration
Logs “Engine {model_id} is dead, rebooting” followed by “Successfully rebooted engine {model_id}”
Preserves all original configuration including KV cache settings, prefix cache, and tool callbacks

This ensures high availability without manual intervention.

Thread Model

Each model loaded in mistral.rs runs in its own dedicated engine thread:

Main Thread: Handles HTTP requests, CLI interaction, and dispatches work to engine threads
Engine Threads: Each loaded model has a dedicated thread for inference
Background Workers: Tokenization and other preprocessing can run in parallel

For multi-model setups, each model gets its own engine thread, allowing true parallel inference across different models.

Keyboard shortcuts

mistral.rs Documentation