Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

mistral.rs

I want to…Go to…
Install mistral.rsInstallation Guide
Understand cargo featuresCargo Features
Run a modelCLI Reference
Use the HTTP APIHTTP Server
Fix an errorTroubleshooting
Configure environmentConfiguration
Check model supportSupported Models

Getting Started

SDKs & APIs

Models

By Category

Model-Specific Guides

Click to expand model guides

Text Models:

Vision Models:

Other Models:

Quantization & Optimization

Adapters & Model Customization

Performance & Hardware

Features

MCP (Model Context Protocol)

Reference


Contributing

See the main README for contribution guidelines.

Installation Guide

The install script automatically detects your hardware (CUDA, Metal, MKL) and builds with optimal features.

Linux/macOS:

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

Windows (PowerShell):

irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

Prerequisites

  1. Install required packages:

    • OpenSSL: sudo apt install libssl-dev (Ubuntu)
    • pkg-config (Linux only): sudo apt install pkg-config
  2. Install Rust from https://rustup.rs/

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    source $HOME/.cargo/env
    
  3. (Optional) Set up HuggingFace authentication:

    mistralrs login
    

    Or use huggingface-cli login as documented here.

Supported Accelerators

AcceleratorFeature FlagAdditional Flags
NVIDIA GPUs (CUDA)cudaflash-attn, flash-attn-v3, cudnn
Apple Silicon GPU (Metal)metal
CPU (Intel)mkl
CPU (Apple Accelerate)accelerate
Generic CPU (ARM/AVX)noneARM NEON / AVX enabled by default

Note for Linux users: The metal feature is macOS-only. Use --features "cuda flash-attn cudnn" for NVIDIA GPUs or --features mkl for Intel CPUs instead of --all-features.

Feature Detection

Determine which features to enable based on your hardware:

HardwareFeatures
NVIDIA GPU (Ampere+, compute >=80)cuda cudnn flash-attn
NVIDIA GPU (Hopper, compute 90)cuda cudnn flash-attn flash-attn-v3
NVIDIA GPU (older)cuda cudnn
Apple Silicon (macOS)metal accelerate
Intel CPU with MKLmkl
CPU only(no features needed)

Install from crates.io

cargo install mistralrs-cli --features "<your-features>"

Example:

cargo install mistralrs-cli --features "cuda flash-attn cudnn"

Build from Source

git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo install --path mistralrs-cli --features "<your-features>"

Example:

cargo build --release --features "cuda flash-attn cudnn"

Docker

Docker images are available for quick deployment:

docker pull ghcr.io/ericlbuehler/mistral.rs:latest
docker run --gpus all -p 1234:1234 ghcr.io/ericlbuehler/mistral.rs:latest \
  serve -m Qwen/Qwen3-4B

Docker images on GitHub Container Registry

Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/

Python SDK

Install the Python package:

pip install mistralrs-cuda    # For NVIDIA GPUs
pip install mistralrs-metal   # For Apple Silicon
pip install mistralrs-mkl     # For Intel CPUs
pip install mistralrs         # CPU-only

Verify Installation

After installation, verify everything works:

# Check CLI is installed
mistralrs --help

# Run system diagnostics
mistralrs doctor

# Test with a small model
mistralrs run -m Qwen/Qwen3-0.6B

Getting Models

From Hugging Face Hub (Default)

Models download automatically from Hugging Face Hub:

mistralrs run -m meta-llama/Llama-3.2-3B-Instruct

For gated models, authenticate first:

mistralrs login
# Or: mistralrs run --token-source env:HF_TOKEN -m <model>

From Local Files

Pass a path to a downloaded model:

mistralrs run -m /path/to/model

Running GGUF Models

mistralrs run --format gguf -m author/model-repo -f model-quant.gguf

Specify tokenizer if needed:

mistralrs run --format gguf -m author/model-repo -f file.gguf -t author/official-tokenizer

Next Steps

Cargo Features Reference

This document provides a complete reference for all cargo features available in mistral.rs.

Quick Reference

FeatureDescriptionPlatformRequires
cudaNVIDIA GPU accelerationLinux, WindowsCUDA toolkit
cudnnNVIDIA cuDNN backendLinux, Windowscuda, cuDNN
flash-attnFlashAttention V2Linux, Windowscuda, CC >= 8.0
flash-attn-v3FlashAttention V3Linux, Windowscuda, CC >= 9.0
metalApple GPU accelerationmacOS-
accelerateApple CPU accelerationmacOS-
mklIntel MKL accelerationLinux, WindowsIntel MKL
ncclMulti-GPU (NVIDIA NCCL)Linuxcuda, NCCL
ringMulti-GPU/node (TCP ring)All-

GPU Acceleration Features

cuda

Enables NVIDIA GPU acceleration via CUDA. This is the primary feature for running on NVIDIA GPUs.

Requirements:

  • NVIDIA GPU
  • CUDA toolkit installed
  • Linux or Windows (WSL supported)

Usage:

cargo build --release --features cuda
cargo install mistralrs-cli --features cuda

What it enables:

  • GPU tensor operations via CUDA
  • PagedAttention on CUDA devices
  • Quantized inference on GPU

cudnn

Enables NVIDIA cuDNN for optimized neural network primitives. Provides faster convolutions and other operations.

Requirements:

  • cuda feature
  • cuDNN library installed

Usage:

cargo build --release --features "cuda cudnn"

flash-attn

Enables FlashAttention V2 for faster attention computation. Significantly reduces memory usage and improves throughput.

Requirements:

  • cuda feature (automatically enabled)
  • GPU with compute capability >= 8.0 (Ampere or newer)

Compatible GPUs:

ArchitectureCompute CapabilityExample GPUs
Ampere8.0, 8.6RTX 30 series, A100, A40
Ada Lovelace8.9RTX 40 series, L40S
Blackwell10.0, 12.0RTX 50 series

Usage:

cargo build --release --features "cuda flash-attn cudnn"

Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.


flash-attn-v3

Enables FlashAttention V3 for Hopper architecture GPUs. Provides additional performance improvements over V2 on supported hardware.

Requirements:

  • cuda feature (automatically enabled)
  • GPU with compute capability >= 9.0 (Hopper)

Compatible GPUs:

ArchitectureCompute CapabilityExample GPUs
Hopper9.0H100, H800

Usage:

cargo build --release --features "cuda flash-attn-v3 cudnn"

Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.


metal

Enables Apple Metal GPU acceleration for macOS devices.

Requirements:

  • macOS with Apple Silicon or AMD GPU
  • macOS only (not available on Linux)

Usage:

cargo build --release --features metal

What it enables:

  • GPU tensor operations via Metal
  • PagedAttention on Metal devices (opt-in via --paged-attn)
  • Quantized inference on Apple GPUs

Note: PagedAttention is disabled by default on Metal. Enable with --paged-attn flag.


CPU Acceleration Features

accelerate

Enables Apple’s Accelerate framework for optimized CPU operations on macOS.

Requirements:

  • macOS

Usage:

cargo build --release --features accelerate
# Or combined with Metal:
cargo build --release --features "metal accelerate"

mkl

Enables Intel Math Kernel Library (MKL) for optimized CPU operations.

Requirements:

  • Intel MKL installed
  • Intel CPU recommended (works on AMD but Intel-optimized)

Usage:

cargo build --release --features mkl

Distributed Inference Features

nccl

Enables multi-GPU distributed inference using NVIDIA NCCL (NVIDIA Collective Communications Library). Implements tensor parallelism for splitting large models across multiple GPUs.

Requirements:

  • cuda feature (automatically enabled)
  • Multiple NVIDIA GPUs
  • NCCL library
  • World size must be a power of 2 (1, 2, 4, 8, etc.)

Usage:

cargo build --release --features "cuda nccl"

# Run with specific GPU count
MISTRALRS_MN_LOCAL_WORLD_SIZE=2 mistralrs serve -m Qwen/Qwen3-30B-A3B-Instruct

Environment Variables:

VariableDescription
MISTRALRS_MN_LOCAL_WORLD_SIZENumber of GPUs to use (defaults to all)
MISTRALRS_NO_NCCL=1Disable NCCL and use device mapping instead

Multi-node setup requires additional environment variables. See NCCL documentation for details.

Note: When NCCL is enabled, automatic device mapping is disabled.


ring

Enables distributed tensor-parallel inference using a TCP-based ring topology. Works across multiple machines without requiring NCCL.

Requirements:

  • World size must be a power of 2 (2, 4, 8, etc.)
  • TCP ports must be open between nodes

Usage:

cargo build --release --features ring

# Configure via JSON file
export RING_CONFIG=path/to/ring_config.json
mistralrs serve -m model-id

Configuration:

Create a JSON configuration file for each process:

{
  "master_ip": "0.0.0.0",
  "master_port": 1234,
  "port": 12345,
  "right_port": 12346,
  "rank": 0,
  "world_size": 2
}
FieldDescription
master_ipIP address for master node
master_portPort for master node
portLocal port for incoming connections
right_portPort of right neighbor in ring
right_ipIP of right neighbor (optional, defaults to localhost)
rankProcess rank (0 to world_size-1)
world_sizeTotal number of processes (must be power of 2)

See Ring documentation for detailed setup instructions.


Feature Combinations

HardwareRecommended Features
NVIDIA Ampere+ (RTX 30/40, A100)cuda cudnn flash-attn
NVIDIA Hopper (H100)cuda cudnn flash-attn-v3
NVIDIA older GPUscuda cudnn
Apple Siliconmetal accelerate
Intel CPUmkl
Generic CPU(no features needed)
Multi-GPU NVIDIAcuda cudnn flash-attn nccl
Multi-node/cross-platformring (plus GPU features)

Installation Examples

# NVIDIA GPU with all optimizations
cargo install mistralrs-cli --features "cuda cudnn flash-attn"

# Apple Silicon
cargo install mistralrs-cli --features "metal accelerate"

# Intel CPU
cargo install mistralrs-cli --features "mkl"

# Multi-GPU NVIDIA setup
cargo install mistralrs-cli --features "cuda cudnn flash-attn nccl"

# Build from source with CUDA
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo build --release --features "cuda cudnn flash-attn"

Internal Features

These features are primarily for library development and are not typically used directly:

FeatureDescription
pyo3_macrosPython bindings support (used by mistralrs-pyo3)
utoipaOpenAPI documentation generation

Python Package Features

The Python SDK is distributed as separate packages with features pre-configured:

PackageEquivalent Features
mistralrs-cudacuda cudnn flash-attn
mistralrs-metalmetal accelerate
mistralrs-mklmkl
mistralrsCPU only
pip install mistralrs-cuda    # NVIDIA GPUs
pip install mistralrs-metal   # Apple Silicon
pip install mistralrs-mkl     # Intel CPUs
pip install mistralrs         # Generic CPU

Troubleshooting

Diagnosing Issues

Use mistralrs doctor to diagnose your system configuration and verify features are working correctly:

mistralrs doctor

This command checks:

  • Detected hardware (GPUs, CPU features)
  • Installed libraries (CUDA, cuDNN, etc.)
  • Feature compatibility
  • Common configuration issues

Feature not working

  1. Run mistralrs doctor to check system configuration

  2. Verify the feature is enabled in your build:

    cargo build --release --features "your-features" -v
    
  3. Check hardware compatibility (especially for flash-attn)

  4. Ensure required libraries are installed (CUDA, cuDNN, MKL, etc.)

Conflicting features

  • flash-attn and flash-attn-v3 are mutually exclusive
  • metal is macOS-only; don’t use with cuda
  • nccl requires cuda

Build errors

  • CUDA not found: Ensure CUDA toolkit is installed and nvcc is in PATH
  • MKL not found: Install Intel oneAPI or standalone MKL
  • Metal errors on Linux: Remove metal feature (macOS only)

See Troubleshooting for more solutions.

mistralrs CLI Reference

This is the comprehensive CLI reference for mistralrs. The CLI provides commands for interactive mode, HTTP server, builtin UI, quantization, and system diagnostics.

Table of Contents


Commands

run - Interactive Mode

Start a model in interactive mode for conversational use.

mistralrs run [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]

Note: MODEL_TYPE is optional and defaults to auto if not specified. This allows a shorter syntax.

Examples:

# Run a text model interactively (shorthand - auto type is implied)
mistralrs run -m Qwen/Qwen3-4B

# Explicit auto type (equivalent to above)
mistralrs run -m Qwen/Qwen3-4B

# Run with thinking mode enabled
mistralrs run -m Qwen/Qwen3-4B --enable-thinking

# Run a vision model
mistralrs run -m google/gemma-3-4b-it

Options:

OptionDescription
--enable-thinkingEnable thinking mode for models that support it

The run command also accepts all runtime options.


serve - HTTP Server

Start an HTTP server with OpenAI-compatible API endpoints.

mistralrs serve [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]

Note: MODEL_TYPE is optional and defaults to auto if not specified.

Examples:

# Start server on default port 1234 (shorthand)
mistralrs serve -m Qwen/Qwen3-4B

# Explicit auto type (equivalent to above)
mistralrs serve -m Qwen/Qwen3-4B

# Start server with web UI
mistralrs serve -m Qwen/Qwen3-4B --ui

# Start server on custom port
mistralrs serve -m Qwen/Qwen3-4B -p 3000

# Start server with MCP support
mistralrs serve -m Qwen/Qwen3-4B --mcp-port 8081

Server Options:

OptionDefaultDescription
-p, --port <PORT>1234HTTP server port
--host <HOST>0.0.0.0Bind address
--uidisabledServe built-in web UI at /ui
--mcp-port <PORT>noneMCP protocol server port
--mcp-config <PATH>noneMCP client configuration file

The serve command also accepts all runtime options.


quantize - UQFF Generation

Generate a UQFF (Unified Quantized File Format) file from a model.

mistralrs quantize <MODEL_TYPE> -m <MODEL_ID> --isq <LEVEL> -o <OUTPUT>

Examples:

# Quantize a text model to 4-bit
mistralrs quantize -m Qwen/Qwen3-4B --isq 4 -o qwen3-4b-q4.uqff

# Quantize with Q4_K format
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-q4k.uqff

# Quantize a vision model
mistralrs quantize -m google/gemma-3-4b-it --isq 4 -o gemma3-4b-q4.uqff

# Quantize with imatrix for better quality
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k --imatrix imatrix.dat -o qwen3-4b-q4k.uqff

Quantize Options:

OptionRequiredDescription
-m, --model-id <ID>YesModel ID or local path
--isq <LEVEL>YesQuantization level (see ISQ Quantization)
-o, --output <PATH>YesOutput UQFF file path
--isq-organization <TYPE>NoISQ organization strategy: default or moqe
--imatrix <PATH>Noimatrix file for enhanced quantization
--calibration-file <PATH>NoCalibration file for imatrix generation

tune - Recommendations

Get quantization and device mapping recommendations for a model. The tune command analyzes your hardware and shows all quantization options with their estimated memory usage, context room, and quality trade-offs.

mistralrs tune [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]

Note: MODEL_TYPE is optional and defaults to auto if not specified, which supports all model types. See details.

Examples:

# Get balanced recommendations (shorthand)
mistralrs tune -m Qwen/Qwen3-4B

# Get quality-focused recommendations
mistralrs tune -m Qwen/Qwen3-4B --profile quality

# Get fast inference recommendations
mistralrs tune -m Qwen/Qwen3-4B --profile fast

# Output as JSON
mistralrs tune -m Qwen/Qwen3-4B --json

# Generate a TOML config file with recommendations
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml

Example Output (CUDA):

Tuning Analysis
===============

Model: Qwen/Qwen3-4B
Profile: Balanced
Backend: cuda
Total VRAM: 24.0 GB

Quantization Options
--------------------
┌─────────────┬───────────┬────────┬──────────────┬───────────────┬──────────────────┐
│ Quant       │ Est. Size │ VRAM % │ Context Room │ Quality       │ Status           │
├─────────────┼───────────┼────────┼──────────────┼───────────────┼──────────────────┤
│ None (FP16) │ 8.50 GB   │ 35%    │ 48k          │ Baseline      │ ✅ Fits          │
│ Q8_0        │ 4.50 GB   │ 19%    │ 96k          │ Near-lossless │ 🚀 Recommended   │
│ Q6K         │ 3.70 GB   │ 15%    │ 128k (max)   │ Good          │ ✅ Fits          │
│ Q5K         │ 3.20 GB   │ 13%    │ 128k (max)   │ Good          │ ✅ Fits          │
│ Q4K         │ 2.60 GB   │ 11%    │ 128k (max)   │ Acceptable    │ ✅ Fits          │
│ Q3K         │ 2.00 GB   │ 8%     │ 128k (max)   │ Degraded      │ ✅ Fits          │
│ Q2K         │ 1.50 GB   │ 6%     │ 128k (max)   │ Degraded      │ ✅ Fits          │
└─────────────┴───────────┴────────┴──────────────┴───────────────┴──────────────────┘

Recommended Command
-------------------
  mistralrs serve -m Qwen/Qwen3-4B --isq q8_0

[INFO] PagedAttention is available (mode: auto)

Example Output (Metal):

On macOS with Metal, the command recommends Apple Format Quantization (AFQ) types:

Quantization Options
--------------------
┌─────────────┬───────────┬────────┬──────────────┬───────────────┬──────────────────┐
│ Quant       │ Est. Size │ VRAM % │ Context Room │ Quality       │ Status           │
├─────────────┼───────────┼────────┼──────────────┼───────────────┼──────────────────┤
│ None (FP16) │ 8.50 GB   │ 53%    │ 24k          │ Baseline      │ ✅ Fits          │
│ AFQ8        │ 4.50 GB   │ 28%    │ 56k          │ Near-lossless │ 🚀 Recommended   │
│ AFQ6        │ 3.70 GB   │ 23%    │ 64k          │ Good          │ ✅ Fits          │
│ AFQ4        │ 2.60 GB   │ 16%    │ 128k (max)   │ Acceptable    │ ✅ Fits          │
│ AFQ3        │ 2.00 GB   │ 13%    │ 128k (max)   │ Degraded      │ ✅ Fits          │
│ AFQ2        │ 1.50 GB   │ 9%     │ 128k (max)   │ Degraded      │ ✅ Fits          │
└─────────────┴───────────┴────────┴──────────────┴───────────────┴──────────────────┘

Status Legend:

  • 🚀 Recommended: Best option for your profile and hardware
  • Fits: Model fits entirely in GPU memory
  • ⚠️ Hybrid: Model requires CPU offloading (slower due to PCIe bottleneck)
  • Too Large: Model doesn’t fit even with CPU offload

Tune Options:

OptionDefaultDescription
--profile <PROFILE>balancedTuning profile: quality, balanced, or fast
--jsondisabledOutput JSON instead of human-readable text
--emit-config <PATH>noneEmit a TOML config file with recommended settings

doctor - System Diagnostics

Run comprehensive system diagnostics and environment checks. The doctor command helps identify configuration issues and validates your system is ready for inference.

mistralrs doctor [OPTIONS]

Examples:

# Run diagnostics
mistralrs doctor

# Output as JSON
mistralrs doctor --json

Checks Performed:

  • CPU Extensions: AVX, AVX2, AVX-512, FMA support (x86 only; ARM shows NEON)
  • Binary/Hardware Match: Validates CUDA/Metal features match detected hardware
  • GPU Compute Capability: Reports compute version and Flash Attention v2/v3 compatibility
  • Flash Attention Features: Warns if hardware supports FA but binary doesn’t have it enabled
  • Hugging Face Connectivity: Tests connection and token validity using a gated model
  • HF Cache: Verifies cache directory is writable
  • Disk Space: Checks available storage

Options:

OptionDescription
--jsonOutput JSON instead of human-readable text

login - HuggingFace Authentication

Authenticate with HuggingFace Hub by saving your token to the local cache.

mistralrs login [OPTIONS]

Examples:

# Interactive login (prompts for token)
mistralrs login

# Provide token directly
mistralrs login --token hf_xxxxxxxxxxxxx

The token is saved to the standard HuggingFace cache location:

  • Linux/macOS: ~/.cache/huggingface/token
  • Windows: C:\Users\<user>\.cache\huggingface\token

If the HF_HOME environment variable is set, the token is saved to $HF_HOME/token.

Options:

OptionDescription
--token <TOKEN>Provide token directly (non-interactive)

cache - Model Management

Manage the HuggingFace model cache. List cached models or delete specific models.

mistralrs cache <SUBCOMMAND>

Subcommands:

cache list

List all cached models with their sizes and last used times.

mistralrs cache list

Example output:

HuggingFace Model Cache
-----------------------

┌──────────────────────────┬──────────┬─────────────┐
│ Model                    │ Size     │ Last Used   │
├──────────────────────────┼──────────┼─────────────┤
│ Qwen/Qwen3-4B            │ 8.5 GB   │ today       │
│ google/gemma-3-4b-it     │ 6.2 GB   │ 2 days ago  │
│ meta-llama/Llama-3.2-3B  │ 5.8 GB   │ 1 week ago  │
└──────────────────────────┴──────────┴─────────────┘

Total: 3 models, 20.5 GB
Cache directory: /home/user/.cache/huggingface/hub

cache delete

Delete a specific model from the cache.

mistralrs cache delete -m <MODEL_ID>

Examples:

# Delete a specific model
mistralrs cache delete -m Qwen/Qwen3-4B

# Delete a model with organization
mistralrs cache delete -m meta-llama/Llama-3.2-3B

bench - Performance Benchmarking

Run performance benchmarks to measure prefill and decode speeds.

mistralrs bench [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]

Note: MODEL_TYPE is optional and defaults to auto if not specified.

Examples:

# Run default benchmark (512 prompt tokens, 128 generated tokens, 3 iterations)
mistralrs bench -m Qwen/Qwen3-4B

# Custom prompt and generation lengths
mistralrs bench -m Qwen/Qwen3-4B --prompt-len 1024 --gen-len 256

# More iterations for better statistics
mistralrs bench -m Qwen/Qwen3-4B --iterations 10

# With ISQ quantization
mistralrs bench -m Qwen/Qwen3-4B --isq q4k

Example output:

Benchmark Results
=================

Model: Qwen/Qwen3-4B
Iterations: 3

┌────────────────────────┬─────────────────┬─────────────────┐
│ Test                   │ T/s             │ Latency         │
├────────────────────────┼─────────────────┼─────────────────┤
│ Prefill (512 tokens)   │ 2847.3 ± 45.2   │ 179.82 ms (TTFT)│
│ Decode (128 tokens)    │ 87.4 ± 2.1      │ 11.44 ms/T      │
└────────────────────────┴─────────────────┴─────────────────┘
  • T/s: Tokens per second (throughput)
  • Latency: For prefill, shows TTFT (Time To First Token) in milliseconds. For decode, shows ms per token.

Options:

OptionDefaultDescription
--prompt-len <N>512Number of tokens in prompt (prefill test)
--gen-len <N>128Number of tokens to generate (decode test)
--iterations <N>3Number of benchmark iterations
--warmup <N>1Number of warmup runs (discarded)

The bench command also accepts all model loading options (ISQ, device mapping, etc.).


from-config - TOML Configuration

Run the CLI from a TOML configuration file. This is the recommended way to run multiple models simultaneously, including models of different types (e.g., text + vision + embedding).

See CLI_CONFIG.md for full TOML configuration format details.

mistralrs from-config --file <PATH>

Example:

mistralrs from-config --file config.toml

Multi-model example (config.toml):

command = "serve"

[server]
port = 1234
ui = true

[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"

[[models]]
kind = "vision"
model_id = "google/gemma-3-4b-it"

[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"

completions - Shell Completions

Generate shell completions for your shell.

mistralrs completions <SHELL>

Examples:

# Generate bash completions
mistralrs completions bash > ~/.local/share/bash-completion/completions/mistralrs

# Generate zsh completions
mistralrs completions zsh > ~/.zfunc/_mistralrs

# Generate fish completions
mistralrs completions fish > ~/.config/fish/completions/mistralrs.fish

Supported Shells: bash, zsh, fish, elvish, powershell


Model Types

auto

Auto-detect model type. This is the recommended option for most models and is on by default simply by leaving out the explicit model type.

mistralrs run -m Qwen/Qwen3-4B
mistralrs serve -m Qwen/Qwen3-4B

The auto type supports text, vision, and other model types through automatic detection.

text

Explicit text generation model configuration.

mistralrs run text -m Qwen/Qwen3-4B
mistralrs serve text -m Qwen/Qwen3-4B

vision

Vision-language models that can process images and text.

mistralrs run vision -m google/gemma-3-4b-it
mistralrs serve vision -m google/gemma-3-4b-it

Vision Options:

OptionDescription
--max-edge <SIZE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <N>Maximum number of images per request
--max-image-length <SIZE>Maximum image dimension for device mapping

diffusion

Image generation models using diffusion.

mistralrs run diffusion -m black-forest-labs/FLUX.1-schnell
mistralrs serve diffusion -m black-forest-labs/FLUX.1-schnell

speech

Speech synthesis models.

mistralrs run speech -m nari-labs/Dia-1.6B
mistralrs serve speech -m nari-labs/Dia-1.6B

embedding

Text embedding models. These do not support interactive mode but can be used with the HTTP server.

mistralrs serve embedding -m google/embeddinggemma-300m

Features

ISQ Quantization

In-situ quantization (ISQ) reduces model memory usage by quantizing weights at load time. See details about ISQ here.

Usage:

# Simple bit-width quantization
mistralrs run -m Qwen/Qwen3-4B --isq 4
mistralrs run -m Qwen/Qwen3-4B --isq 8

# GGML-style quantization
mistralrs run -m Qwen/Qwen3-4B --isq q4_0
mistralrs run -m Qwen/Qwen3-4B --isq q4_1
mistralrs run -m Qwen/Qwen3-4B --isq q4k
mistralrs run -m Qwen/Qwen3-4B --isq q5k
mistralrs run -m Qwen/Qwen3-4B --isq q6k

ISQ Organization:

# Use MOQE organization for potentially better quality
mistralrs run -m Qwen/Qwen3-4B --isq q4k --isq-organization moqe

UQFF Files

UQFF (Unified Quantized File Format) provides pre-quantized model files for faster loading.

Generate a UQFF file:

mistralrs quantize auto -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-q4k.uqff

Load from UQFF:

mistralrs run -m Qwen/Qwen3-4B --from-uqff qwen3-4b-q4k.uqff

Multiple UQFF files (semicolon-separated):

mistralrs run -m Qwen/Qwen3-4B --from-uqff "part1.uqff;part2.uqff"

PagedAttention

PagedAttention enables efficient memory management for the KV cache. It is automatically enabled on CUDA and disabled on Metal/CPU by default.

Control PagedAttention:

# Auto mode (default): enabled on CUDA, disabled on Metal/CPU
mistralrs serve -m Qwen/Qwen3-4B --paged-attn auto

# Force enable
mistralrs serve -m Qwen/Qwen3-4B --paged-attn on

# Force disable
mistralrs serve -m Qwen/Qwen3-4B --paged-attn off

Memory allocation options (mutually exclusive):

# Allocate for specific context length (recommended)
mistralrs serve -m Qwen/Qwen3-4B --pa-context-len 8192

# Allocate specific GPU memory in MB
mistralrs serve -m Qwen/Qwen3-4B --pa-memory-mb 4096

# Allocate fraction of GPU memory (0.0-1.0)
mistralrs serve -m Qwen/Qwen3-4B --pa-memory-fraction 0.8

Additional options:

OptionDescription
--pa-block-size <SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <TYPE>KV cache quantization type (default: auto)

Device Mapping

Control how model layers are distributed across devices.

Automatic mapping:

# Use defaults (automatic)
mistralrs run -m Qwen/Qwen3-4B

Manual layer assignment:

# Assign 10 layers to GPU 0, 20 layers to GPU 1
mistralrs run -m Qwen/Qwen3-4B -n "0:10;1:20"

# Equivalent long form
mistralrs run -m Qwen/Qwen3-4B --device-layers "0:10;1:20"

CPU-only execution:

mistralrs run -m Qwen/Qwen3-4B --cpu

Topology file:

mistralrs run -m Qwen/Qwen3-4B --topology topology.yaml

Custom HuggingFace cache:

mistralrs run -m Qwen/Qwen3-4B --hf-cache /path/to/cache

Device mapping options:

OptionDefaultDescription
-n, --device-layers <MAPPING>autoDevice layer mapping (format: ORD:NUM;...)
--topology <PATH>noneTopology YAML file for device mapping
--hf-cache <PATH>noneCustom HuggingFace cache directory
--cpudisabledForce CPU-only execution
--max-seq-len <LEN>4096Max sequence length for automatic device mapping
--max-batch-size <SIZE>1Max batch size for automatic device mapping

LoRA and X-LoRA

Apply LoRA or X-LoRA adapters to models.

LoRA:

# Single LoRA adapter
mistralrs run -m Qwen/Qwen3-4B --lora my-lora-adapter

# Multiple LoRA adapters (semicolon-separated)
mistralrs run -m Qwen/Qwen3-4B --lora "adapter1;adapter2"

X-LoRA:

# X-LoRA adapter with ordering file
mistralrs run -m Qwen/Qwen3-4B --xlora my-xlora-adapter --xlora-order ordering.json

# With target non-granular index
mistralrs run -m Qwen/Qwen3-4B --xlora my-xlora-adapter --xlora-order ordering.json --tgt-non-granular-index 2

Chat Templates

Override the model’s default chat template.

Use a template file:

# JSON template file
mistralrs run -m Qwen/Qwen3-4B --chat-template template.json

# Jinja template file
mistralrs run -m Qwen/Qwen3-4B --chat-template template.jinja

Explicit Jinja override:

mistralrs run -m Qwen/Qwen3-4B --jinja-explicit custom.jinja

Enable web search capabilities (requires an embedding model).

# Enable search with default embedding model
mistralrs run -m Qwen/Qwen3-4B --enable-search

# Specify embedding model
mistralrs run -m Qwen/Qwen3-4B --enable-search --search-embedding-model embedding-gemma

Thinking Mode

Enable thinking/reasoning mode for models that support it (like DeepSeek, Qwen3).

mistralrs run -m Qwen/Qwen3-4B --enable-thinking

In interactive mode, thinking content is displayed in gray text before the final response.


Global Options

These options apply to all commands.

OptionDefaultDescription
--seed <SEED>noneRandom seed for reproducibility
-l, --log <PATH>noneLog all requests and responses to file
--token-source <SOURCE>cacheHuggingFace authentication token source
-V, --versionN/APrint version information and exit
-h, --helpN/APrint help message (use with any subcommand)

Token source formats:

  • cache - Use cached HuggingFace token (default)
  • literal:<token> - Use literal token value
  • env:<var> - Read token from environment variable
  • path:<file> - Read token from file
  • none - No authentication

Examples:

# Set random seed
mistralrs run -m Qwen/Qwen3-4B --seed 42

# Enable logging
mistralrs run -m Qwen/Qwen3-4B --log requests.log

# Use token from environment variable
mistralrs run -m meta-llama/Llama-3.2-3B-Instruct --token-source env:HF_TOKEN

Runtime Options

These options are available for both run and serve commands.

OptionDefaultDescription
--max-seqs <N>32Maximum concurrent sequences
--no-kv-cachedisabledDisable KV cache entirely
--prefix-cache-n <N>16Number of prefix caches to hold (0 to disable)
-c, --chat-template <PATH>noneCustom chat template file (.json or .jinja)
-j, --jinja-explicit <PATH>noneExplicit JINJA template override
--enable-searchdisabledEnable web search
--search-embedding-model <MODEL>noneEmbedding model for search

Model Source Options

These options are common across model types.

OptionDescription
-m, --model-id <ID>HuggingFace model ID or local path (required)
-t, --tokenizer <PATH>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <TYPE>Model data type (default: auto)

Format Options

For loading quantized models.

OptionDescription
--format <FORMAT>Model format: plain, gguf, or ggml (auto-detected)
-f, --quantized-file <FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated)
--tok-model-id <ID>Model ID for tokenizer when using quantized format
--gqa <VALUE>GQA value for GGML models (default: 1)

Examples:

# Load a GGUF model
mistralrs run -m Qwen/Qwen3-4B --format gguf -f model.gguf

# Multiple GGUF files
mistralrs run -m Qwen/Qwen3-4B --format gguf -f "model-part1.gguf;model-part2.gguf"

Interactive Commands

When running in interactive mode (mistralrs run), the following commands are available:

CommandDescription
\helpDisplay help message
\exitQuit interactive mode
\system <message>Add a system message without running the model
\clearClear the chat history
\temperature <float>Set sampling temperature (0.0 to 2.0)
\topk <int>Set top-k sampling value (>0)
\topp <float>Set top-p sampling value (0.0 to 1.0)

Examples:

> \system Always respond as a pirate.
> \temperature 0.7
> \topk 50
> Hello!
Ahoy there, matey! What brings ye to these waters?
> \clear
> \exit

Vision Model Interactive Mode:

For vision models, you can include images in your prompts by specifying file paths or URLs:

> Describe this image: /path/to/image.jpg
> Compare these images: image1.png image2.png
> Describe the image and transcribe the audio: photo.jpg recording.mp3

Note: The CLI automatically detects paths to supported image and audio files within your prompt. You do not need special syntax; simply paste the absolute or relative path to the file.

Supported image formats: PNG, JPEG, BMP, GIF, WebP Supported audio formats: WAV, MP3, FLAC, OGG

mistralrs-cli TOML Config

mistralrs-cli can run entirely from a single TOML configuration file. This config supports multiple models and mirrors the CLI options.

Usage

mistralrs from-config --file path/to/config.toml

Quick Example

command = "serve"

[server]
port = 1234
ui = true

[runtime]
max_seqs = 32

[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"

[models.quantization]
in_situ_quant = "q4k"

Complete Reference

Top-Level Options

OptionCommandsDescription
commandallRequired. Either "serve" or "run"
enable_thinkingrunEnable thinking mode (default: false)
default_model_idserveDefault model ID for API requests (must match a model_id in [[models]])

[global] Section

Global options that apply to the entire run.

OptionDefaultDescription
seednoneRandom seed for reproducibility
lognoneLog all requests/responses to this file path
token_source"cache"HuggingFace auth: "cache", "none", "literal:<token>", "env:<var>", "path:<file>"

[server] Section (serve only)

HTTP server configuration.

OptionDefaultDescription
port1234HTTP server port
host"0.0.0.0"Bind address
uifalseServe built-in web UI at /ui
mcp_portnoneMCP protocol server port (enables MCP if set)
mcp_confignoneMCP client configuration file path

[runtime] Section

Runtime inference options.

OptionDefaultDescription
max_seqs32Maximum concurrent sequences
no_kv_cachefalseDisable KV cache entirely
prefix_cache_n16Number of prefix caches to hold (0 to disable)
chat_templatenoneCustom chat template file (.json or .jinja)
jinja_explicitnoneExplicit JINJA template override
enable_searchfalseEnable web search
search_embedding_modelnoneEmbedding model for search (e.g., "embedding-gemma")

[paged_attn] Section

PagedAttention configuration.

OptionDefaultDescription
mode"auto""auto" (CUDA on, Metal off), "on", or "off"
context_lennoneAllocate KV cache for this context length
memory_mbnoneGPU memory to allocate in MB (conflicts with context_len)
memory_fractionnoneGPU memory utilization 0.0-1.0 (conflicts with above)
block_size32Tokens per block
cache_type"auto"KV cache type

Note: If none of context_len, memory_mb, or memory_fraction are specified, defaults to 90% of available VRAM. Each are mutually exclusive.

[[models]] Section

Define one or more models. Each [[models]] entry creates a new model.

Top-Level Model Options

OptionRequiredDescription
kindyesModel type: "auto", "text", "vision", "diffusion", "speech", "embedding"
model_idyesHuggingFace model ID or local path
tokenizernoPath to local tokenizer.json
archnoModel architecture (auto-detected if not specified)
dtype"auto"Data type: "auto", "f16", "bf16", "f32"
chat_templatenoPer-model chat template override
jinja_explicitnoPer-model JINJA template override

[models.format] - Model Format

OptionDefaultDescription
formatauto"plain" (safetensors), "gguf", or "ggml"
quantized_filenoneQuantized filename(s) for GGUF/GGML (semicolon-separated)
tok_model_idnoneModel ID for tokenizer when using quantized format
gqa1GQA value for GGML models

[models.adapter] - LoRA/X-LoRA

OptionDescription
loraLoRA adapter ID(s), semicolon-separated
xloraX-LoRA adapter ID (conflicts with lora)
xlora_orderX-LoRA ordering JSON file (requires xlora)
tgt_non_granular_indexTarget non-granular index for X-LoRA

[models.quantization] - ISQ/UQFF

OptionDescription
in_situ_quantISQ level: "4", "8", "q4_0", "q4k", "q6k", etc.
from_uqffUQFF file(s) to load (semicolon-separated)
isq_organizationISQ strategy: "default" or "moqe"
imatriximatrix file for enhanced quantization
calibration_fileCalibration file for imatrix generation

[models.device] - Device Mapping

OptionDefaultDescription
cpufalseForce CPU-only (must be consistent across all models)
device_layersautoLayer mapping, e.g., ["0:10", "1:20"] format: ORD:NUM;...
topologynoneTopology YAML file
hf_cachenoneCustom HuggingFace cache directory
max_seq_len4096Max sequence length for auto device mapping
max_batch_size1Max batch size for auto device mapping

[models.vision] - Vision Options

OptionDescription
max_edgeMaximum edge length for image resizing
max_num_imagesMaximum images per request
max_image_lengthMaximum image dimension for device mapping

Full Examples

Multi-Model Server with UI

command = "serve"

[global]
seed = 42

[server]
host = "0.0.0.0"
port = 1234
ui = true

[runtime]
max_seqs = 32
enable_search = true
search_embedding_model = "embedding-gemma"

[paged_attn]
mode = "auto"

[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.2-3B-Instruct"
dtype = "auto"

[models.quantization]
in_situ_quant = "q4k"

[[models]]
kind = "vision"
model_id = "Qwen/Qwen2-VL-2B-Instruct"

[models.vision]
max_num_images = 4

[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"

Interactive Mode with Thinking

command = "run"
enable_thinking = true

[runtime]
max_seqs = 16

[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"

GGUF Model

command = "serve"

[server]
port = 1234

[[models]]
kind = "text"
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"

[models.format]
format = "gguf"
quantized_file = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"

Device Layer Mapping

command = "serve"

[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.1-70B-Instruct"

[models.device]
device_layers = ["0:40", "1:40"]

[models.quantization]
in_situ_quant = "q4k"

Notes

  • cpu must be consistent across all models if specified
  • default_model_id (serve only) must match a model_id in [[models]]
  • search_embedding_model requires enable_search = true

Troubleshooting

Common issues and solutions for mistral.rs.

Debug Mode

Enable debug mode for more information:

MISTRALRS_DEBUG=1 mistralrs run -m <model>

Debug mode causes:

  • If loading a GGUF or GGML model, outputs a file containing the names, shapes, and types of each tensor:
    • mistralrs_gguf_tensors.txt or mistralrs_ggml_tensors.txt
  • Increased logging verbosity

System Diagnostics

Run the built-in diagnostics tool:

mistralrs doctor

This checks your system configuration and reports any issues.

Common Issues

CUDA Issues

Setting the CUDA compiler path:

  • Set the NVCC_CCBIN environment variable during build

Error: recompile with -fPIE:

  • Some Linux distributions require compiling with -fPIE
  • Set during build: CUDA_NVCC_FLAGS=-fPIE cargo build --release --features cuda

Error: CUDA_ERROR_NOT_FOUND or symbol not found:

  • For non-quantized models, specify the data type to load and run in
  • Use one of f32, f16, bf16 or auto (auto chooses based on device)
  • Example: mistralrs run -m <model> -d auto

Minimum CUDA compute capability:

  • The minimum supported CUDA compute cap is 5.3
  • Set a specific compute cap with: CUDA_COMPUTE_CAP=80 cargo build --release --features cuda

Metal Issues (macOS)

Metal not found (error: unable to find utility “metal”):

  1. Install Xcode:

    xcode-select --install
    
  2. Set the active developer directory:

    sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
    

error: cannot execute tool ‘metal’ due to missing Metal toolchain

  1. Install Metal Toolchain:
    xcodebuild -downloadComponent MetalToolchain
    

Disabling Metal kernel precompilation:

  • By default, Metal kernels are precompiled during build time for better performance
  • To skip precompilation (useful for CI or when Metal is not needed):
    MISTRALRS_METAL_PRECOMPILE=0 cargo build --release --features metal
    

Memory Issues

Disabling mmap loading:

  • Set MISTRALRS_NO_MMAP=1 to disable memory-mapped file loading
  • Forces all tensor data into memory
  • Useful if you’re seeing mmap-related errors

Out of memory errors:

  • Try using quantization: --isq q4k or --isq q8_0
  • Use device mapping to offload layers: -n 0:16;cpu:16
  • Reduce context length with PagedAttention: --pa-context-len 4096

Model Loading Issues

Model type not auto-detected:

  • If auto-detection fails, please raise an issue
  • You can manually specify the architecture if needed

Chat template issues:

  • Templates are usually auto-detected
  • Override with: -c /path/to/template.jinja
  • See Chat Templates for details

Getting Help

If you’re still stuck:

When reporting issues, please include:

  1. Output of mistralrs doctor
  2. Full error message
  3. Command you ran
  4. Hardware (GPU model, OS)

mistralrs Python SDK

Documentation for the mistralrs Python package.

Installation: See PYTHON_INSTALLATION.md for installation instructions.

Table of contents

  • Full API reference: here
  • Model configuration (Which enum): here
  • Multi-model support: here
  • MCP Client Configuration: here
  • Example: here
  • Embeddings example: here

Which

Each *_model_id may be a HF hub repo or a local path. For quantized GGUF models, a list is accepted if multiple files must be specified.

Architecture for plain models

If you do not specify the architecture, an attempt will be made to use the model’s config. If this fails, please raise an issue.

  • Mistral
  • Gemma
  • Mixtral
  • Llama
  • Phi2
  • Phi3
  • Qwen2
  • Gemma2
  • GLM4
  • Starcoder2
  • Phi3_5MoE
  • DeepseekV2
  • DeepseekV3
  • Qwen3
  • Qwen3Moe
  • SmolLm3
  • GraniteMoeHybrid
  • GptOss

ISQ Organization

  • Default
  • MoQE: if applicable, only quantize MoE experts. https://arxiv.org/abs/2310.02410

Architecture for vision models

  • Phi3V
  • Idefics2
  • LLaVaNext
  • LLaVa
  • VLlama
  • Qwen2VL
  • Idefics3
  • MiniCpmO
  • Phi4MM
  • Qwen2_5VL
  • Gemma3
  • Mistral3
  • Llama4
  • Gemma3n
  • Qwen3VL

Architecture for diffusion models

  • Flux
  • FluxOffloaded

Architecture for speech models

  • Dia

Architecture for embedding models

  • EmbeddingGemma
  • Qwen3Embedding

ISQ Organization

  • Default
  • MoQE: if applicable, only quantize MoE experts. https://arxiv.org/abs/2310.02410

Note: from_uqff specified a UQFF path to load from. If provided, this takes precedence over applying ISQ. Specify multiple files using a semicolon delimiter (;).

Note: enable_thinking enables thinking for models that support the configuration. Note: truncate_sequence=True trims prompts that would otherwise exceed the model’s maximum context length. Leave it False to receive a validation error instead.

class Which(Enum):
    @dataclass
    class Plain:
        model_id: str
        arch: Architecture | None = None
        tokenizer_json: str | None = None
        topology: str | None = None
        organization: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)
        calibration_file: str | None = None
        imatrix: str | None = None
        hf_cache_path: str | None = None

    @dataclass
    class XLora:
        xlora_model_id: str
        order: str
        arch: Architecture | None = None
        model_id: str | None = None
        tokenizer_json: str | None = None
        tgt_non_granular_index: int | None = None
        topology: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)
        hf_cache_path: str | None = None

    @dataclass
    class Lora:
        adapter_model_id: str
        arch: Architecture | None = None
        model_id: str | None = None
        tokenizer_json: str | None = None
        topology: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)
        hf_cache_path: str | None = None

    @dataclass
    class GGUF:
        quantized_model_id: str
        quantized_filename: str | list[str]
        tok_model_id: str | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class XLoraGGUF:
        quantized_model_id: str
        quantized_filename: str | list[str]
        xlora_model_id: str
        order: str
        tok_model_id: str | None = None
        tgt_non_granular_index: int | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class LoraGGUF:
        quantized_model_id: str
        quantized_filename: str | list[str]
        adapters_model_id: str
        order: str
        tok_model_id: str | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class GGML:
        quantized_model_id: str
        quantized_filename: str
        tok_model_id: str | None = None
        tokenizer_json: str | None = None
        gqa: int | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class XLoraGGML:
        quantized_model_id: str
        quantized_filename: str
        xlora_model_id: str
        order: str
        tok_model_id: str | None = None
        tgt_non_granular_index: int | None = None
        tokenizer_json: str | None = None
        gqa: int | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class LoraGGML:
        quantized_model_id: str
        quantized_filename: str
        adapters_model_id: str
        order: str
        tok_model_id: str | None = None
        tokenizer_json: str | None = None
        topology: str | None = None
        dtype: ModelDType = ModelDType.Auto
        auto_map_params: TextAutoMapParams | None = (None,)

    @dataclass
    class Embedding:
        model_id: str
        arch: EmbeddingArchitecture | None = None
        tokenizer_json: str | None = None
        topology: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        hf_cache_path: str | None = None

    @dataclass
    class VisionPlain:
        model_id: str
        arch: VisionArchitecture
        tokenizer_json: str | None = None
        topology: str | None = None
        from_uqff: str | list[str] | None = None
        write_uqff: str | None = None
        dtype: ModelDType = ModelDType.Auto
        max_edge: int | None = None
        auto_map_params: VisionAutoMapParams | None = (None,)
        calibration_file: str | None = None
        imatrix: str | None = None
        hf_cache_path: str | None = None

    @dataclass
    class DiffusionPlain:
        model_id: str
        arch: DiffusionArchitecture
        dtype: ModelDType = ModelDType.Auto

    @dataclass
    class Speech:
        model_id: str
        arch: DiffusionArchitecture
        dac_model_id: str | None = None
        dtype: ModelDType = ModelDType.Auto

Multi-model Support

The mistralrs Python SDK supports running multiple models using the Runner class with the model_id parameter. All request methods accept an optional model_id to target a specific model. When model_id is None or omitted, the default model is used. If aliases are configured (for example via the server config or Rust MultiModelBuilder), list_models() will return those aliases and you can pass them in requests; canonical pipeline names remain accepted.

Basic Usage with model_id

import mistralrs

# Create a Runner with a vision model (Gemma 3 4B)
runner = mistralrs.Runner(
    which=mistralrs.Which.VisionPlain(
        model_id="google/gemma-3-4b-it",
        arch=mistralrs.VisionArchitecture.Gemma3,
    ),
    in_situ_quant="Q4K",
)

# List available models (model IDs are registered IDs, aliases if configured)
models = runner.list_models()
print(f"Available models: {models}")  # ["google/gemma-3-4b-it"]

# Send request to specific model using model_id parameter
response = runner.send_chat_completion_request(
    mistralrs.ChatCompletionRequest(
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=100
    ),
    model_id="google/gemma-3-4b-it"  # Target specific model
)

# Send request without model_id (uses default model)
response = runner.send_chat_completion_request(
    mistralrs.ChatCompletionRequest(
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=100
    )
)

Multi-model Management

# List available models
models = runner.list_models()
print(f"Available models: {models}")

# Get/set default model
default_model = runner.get_default_model_id()
print(f"Default model: {default_model}")

# Change default model (model must be loaded)
runner.set_default_model_id("google/gemma-3-4b-it")

# List models with their status
models_with_status = runner.list_models_with_status()
for model_id, status in models_with_status:
    print(f"{model_id}: {status}")  # status is "loaded", "unloaded", or "reloading"

Model Unloading and Reloading

You can unload models to free memory and reload them on demand:

model_id = "google/gemma-3-4b-it"

# Check if model is loaded
is_loaded = runner.is_model_loaded(model_id)
print(f"Model loaded: {is_loaded}")

# List models with their status
models_with_status = runner.list_models_with_status()
for mid, status in models_with_status:
    print(f"{mid}: {status}")

# Unload a model to free memory (preserves configuration for reload)
runner.unload_model(model_id)

# Check status after unload
is_loaded = runner.is_model_loaded(model_id)
print(f"Model loaded after unload: {is_loaded}")  # False

# Manually reload a model
runner.reload_model(model_id)

# Auto-reload: sending a request to an unloaded model will reload it automatically
response = runner.send_chat_completion_request(
    mistralrs.ChatCompletionRequest(
        messages=[{"role": "user", "content": "Hello!"}]
    ),
    model_id=model_id  # Will auto-reload if unloaded
)

Request Methods with model_id

All request methods accept an optional model_id parameter:

# Chat completion
response = runner.send_chat_completion_request(request, model_id="model-id")

# Completion
response = runner.send_completion_request(request, model_id="model-id")

# Embeddings
embeddings = runner.send_embedding_request(request, model_id="model-id")

# Image generation
image = runner.generate_image(prompt, response_format, model_id="model-id")

# Audio generation
audio = runner.generate_audio(prompt, model_id="model-id")

# Tokenization
tokens = runner.tokenize_text(text, add_special_tokens=True, model_id="model-id")
text = runner.detokenize_text(tokens, skip_special_tokens=True, model_id="model-id")

When model_id is None or omitted, the default model is used.

Server Configuration

For server-based multi-model deployment, see the multi-model documentation.

MCP Client

The mistralrs Python SDK now supports Model Context Protocol (MCP) clients, enabling AI assistants to connect to and interact with external tools and resources through standardized server interfaces.

MCP Server Configuration

Configure MCP servers using McpServerConfigPy:

# HTTP-based MCP server with Bearer token authentication
http_server = mistralrs.McpServerConfigPy(
    id="web_search",
    name="Web Search MCP",
    source=mistralrs.McpServerSourcePy.Http(
        url="https://api.example.com/mcp",
        timeout_secs=30,
        headers={"X-API-Version": "v1"}  # Optional additional headers
    ),
    enabled=True,
    tool_prefix="web",  # Prefixes tool names to avoid conflicts
    resources=None,
    bearer_token="your-api-token"  # Automatically added as Authorization header
)

# Process-based MCP server for local tools
process_server = mistralrs.McpServerConfigPy(
    id="filesystem",
    name="Filesystem MCP",
    source=mistralrs.McpServerSourcePy.Process(
        command="mcp-server-filesystem",
        args=["--root", "/tmp"],
        work_dir=None,
        env={"MCP_LOG_LEVEL": "debug"}  # Optional environment variables
    ),
    enabled=True,
    tool_prefix="fs",
    resources=["file://**"],  # Resource patterns this client is interested in
    bearer_token=None  # Process servers typically don't need authentication
)

# WebSocket-based MCP server for real-time communication
websocket_server = mistralrs.McpServerConfigPy(
    id="realtime_data",
    name="Real-time Data MCP",
    source=mistralrs.McpServerSourcePy.WebSocket(
        url="wss://realtime.example.com/mcp",
        timeout_secs=60,
        headers=None
    ),
    enabled=True,
    tool_prefix="rt",
    resources=None,
    bearer_token="websocket-token"  # WebSocket Bearer token support
)

MCP Client Configuration

Configure the MCP client using McpClientConfigPy:

mcp_config = mistralrs.McpClientConfigPy(
    servers=[http_server, process_server, websocket_server],
    auto_register_tools=True,  # Automatically discover and register tools
    tool_timeout_secs=30,      # Timeout for individual tool calls
    max_concurrent_calls=5     # Maximum concurrent tool calls across all servers
)

Integration with Runner

Pass the MCP client configuration to the Runner:

runner = mistralrs.Runner(
    which=mistralrs.Which.GGUF(
        tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
        quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
        quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    ),
    mcp_client_config=mcp_config  # MCP tools automatically registered
)

When auto_register_tools=True, the MCP client will:

  1. Connect to all enabled MCP servers
  2. Discover available tools from each server
  3. Register them for automatic tool calling with appropriate prefixes
  4. Make them available during model conversations

MCP Transport Types

  • HTTP Transport: Best for public APIs, RESTful services, servers behind load balancers. Supports SSE (Server-Sent Events) and standard HTTP semantics.

  • Process Transport: Best for local tools, development servers, sandboxed environments. Provides process isolation with no network overhead.

  • WebSocket Transport: Best for interactive applications, real-time data, low-latency requirements. Supports persistent connections and server-initiated notifications.

Authentication

  • Bearer Tokens: Automatically added as Authorization: Bearer <token> header for HTTP and WebSocket connections
  • Custom Headers: Additional headers can be specified for API keys, versioning, etc.
  • Process Servers: Typically don’t require authentication as they run locally

Example

from mistralrs import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.GGUF(
        tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
        quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
        quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    )
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role":"user", "content":"Tell me a story about the Rust type system."}],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Embeddings example

from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which

runner = Runner(
    which=Which.Embedding(
        model_id="google/embeddinggemma-300m",
        arch=EmbeddingArchitecture.EmbeddingGemma,
    )
)

embeddings = runner.send_embedding_request(
    EmbeddingRequest(
        input=[
            "task: query | text: superconductors",
            "task: query | text: graphene",
        ],
        truncate_sequence=True,
    )
)

print(len(embeddings), len(embeddings[0]))

# Swap the model_id and arch below to load Qwen/Qwen3-Embedding-0.6B instead:
# Runner(
#     which=Which.Embedding(
#         model_id="Qwen/Qwen3-Embedding-0.6B",
#         arch=EmbeddingArchitecture.Qwen3Embedding,
#     )
# )

Python SDK Installation

Pre-built wheels are available for common platforms. Choose the package that matches your hardware:

HardwareInstall Command
Recommended (auto-optimized)pip install mistralrs
NVIDIA GPUs (CUDA)pip install mistralrs-cuda
Apple Silicon (Metal)pip install mistralrs-metal
Apple Acceleratepip install mistralrs-accelerate
Intel CPUs (MKL)pip install mistralrs-mkl

Platform-Specific Optimizations

The mistralrs base package includes platform-specific optimizations:

  • macOS Apple Silicon: Metal GPU support built-in
  • Linux/Windows x86_64: Intel MKL optimizations built-in
  • Linux aarch64: CPU-only (use mistralrs-cuda for GPU support)

All packages install the mistralrs Python module. The package suffix controls which accelerator features are enabled.

Supported Platforms

PackageLinux x86_64Linux aarch64Windows x86_64macOS aarch64
mistralrsMKLCPUMKLMetal
mistralrs-cudaCUDACUDACUDA-
mistralrs-metal---Metal
mistralrs-accelerate---Accelerate
mistralrs-mklMKL-MKL-

Python version: 3.10+ (wheels use abi3 for forward compatibility)

Windows Requirements

It is recommended to use WSL2 on Windows machines.

On Windows, additional runtime dependencies may be required:

# Example: Install with CUDA support
pip install mistralrs-cuda -v

Build from Source

Building from source gives you access to the latest features and allows customization of build options.

Prerequisites

  1. Install system packages:

    Ubuntu/Debian:

    sudo apt install libssl-dev pkg-config
    

    macOS:

    brew install openssl pkg-config
    
  2. Install Rust from https://rustup.rs/:

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    source $HOME/.cargo/env
    
  3. (Optional) Set up HuggingFace authentication for gated models:

    mkdir -p ~/.cache/huggingface
    echo "YOUR_HF_TOKEN" > ~/.cache/huggingface/token
    

    Or use huggingface-cli login.

Build Steps

  1. Clone the repository:

    git clone https://github.com/EricLBuehler/mistral.rs.git
    cd mistral.rs/mistralrs-pyo3
    
  2. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # Linux/macOS
    # or: .venv\Scripts\activate  # Windows
    
  3. Install maturin (Rust + Python build tool):

    pip install maturin[patchelf]
    
  4. Build and install:

    maturin develop -r --features <your-features>
    

Feature Flags

FeatureDescription
cudaNVIDIA GPU support
flash-attnFlash Attention (CUDA, Ampere+)
flash-attn-v3Flash Attention v3 (CUDA, Hopper)
cudnncuDNN optimizations
metalApple Silicon GPU (macOS only)
accelerateApple Accelerate framework
mklIntel MKL

Example with CUDA and Flash Attention:

maturin develop -r --features "cuda flash-attn cudnn"

Verify Installation

import mistralrs
print(mistralrs.__version__)

Quick test:

from mistralrs import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-0.6B"),
)

response = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=50,
    )
)
print(response.choices[0].message.content)

Next Steps

HTTP server

Mistral.rs provides a lightweight OpenAI API compatible HTTP server based on axum. The request and response formats are supersets of the OpenAI API.

The API consists of the following endpoints. They can be viewed in your browser interactively by going to http://localhost:<port>/docs.

ℹ️ Besides the HTTP endpoints described below, mistralrs serve can also expose the same functionality via the MCP protocol. Enable it with --mcp-port <port> and see MCP/server.md for details.

Additional object keys

To support additional features, we have extended the completion and chat completion request objects. Both have the same keys added:

  • top_k: int | null. If non null, it is only relevant if positive.
  • grammar: {"type" : "regex" | "lark" | "json_schema" | "llguidance", "value": string} or null. Grammar to use. This is mutually exclusive to the OpenAI-compatible response_format.
  • min_p: float | null. If non null, it is only relevant if 1 >= min_p >= 0.
  • enable_thinking: bool, default to false. Enable thinking for models that support it.
  • truncate_sequence: bool | null. When true, requests that exceed the model context length will be truncated instead of rejected; otherwise the server returns a validation error. Embedding requests truncate tokens at the end of the prompt, while chat/completion requests truncate tokens at the start of the prompt.
  • repetition_penalty: float | null. Penalty for repeating tokens. This is distinct from frequency_penalty and presence_penalty - it applies a direct multiplicative penalty to repeated token logits.
  • web_search_options: object | null. Enable web search integration (see WEB_SEARCH.md). Contains optional fields: search_context_size (“low”, “medium”, “high”), user_location (object with location info), search_description (override search tool description), extract_description (override extraction tool description).
  • reasoning_effort: string | null. For Harmony-format models (like GPT-OSS), controls the depth of reasoning: "low", "medium", or "high".
  • dry_multiplier: float | null. DRY (Don’t Repeat Yourself) sampling multiplier. Controls the strength of the anti-repetition penalty.
  • dry_base: float | null. DRY sampling base value.
  • dry_allowed_length: int | null. DRY sampling allowed length before penalty applies.
  • dry_sequence_breakers: array of strings | null. Tokens that reset the DRY penalty sequence.

Response Extensions

The response objects include additional fields beyond the standard OpenAI API:

Harmony Mode Responses

For models using Harmony format (like GPT-OSS), responses may include additional reasoning content:

  • reasoning_content: string | null. Chain-of-thought reasoning from Harmony-format models. This field contains the model’s internal analysis and commentary that led to the final response. It is separate from the main content field.

When streaming, reasoning_content appears in the delta object alongside content.

Example response:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me analyze this step by step..."
    }
  }]
}

Model Parameter Validation

Mistral.rs validates that the model parameter in API requests matches the model that was actually loaded by the server. This ensures requests are processed by the correct model and prevents confusion.

Behavior:

  • If the model parameter matches the loaded model name, the request proceeds normally
  • If the model parameter doesn’t match, the request fails with an error message indicating the mismatch
  • The special model name "default" can be used to bypass this validation entirely

Examples:

  • ✅ Request with "model": "meta-llama/Llama-3.2-3B-Instruct" when meta-llama/Llama-3.2-3B-Instruct is loaded → succeeds
  • ❌ Request with "model": "gpt-4" when mistral-7b-instruct is loaded → fails
  • ✅ Request with "model": "default" regardless of loaded model → always succeeds

Usage: Use "default" in the model field when you need to satisfy API clients that require a model parameter but don’t need to specify a particular model. This is demonstrated in all the examples below.

POST: /v1/chat/completions

Process an OpenAI compatible request, returning an OpenAI compatible response when finished. Please find the official OpenAI API documentation here. To control the interval keep-alive messages are sent, set the KEEP_ALIVE_INTERVAL environment variable to the desired time in ms.

To send a request with the Python openai library:

import openai

client = openai.OpenAI(
    base_url="http://localhost:1234/v1", # "http://<Your api-server IP>:port"
    api_key = "EMPTY"
)

completion = client.chat.completions.create(
model="default",
messages=[
    {"role": "system", "content": "You are Mistral.rs, an AI assistant."},
    {"role": "user", "content": "Write a story about Rust error handling."}
]
)

print(completion.choices[0].message)

Or with curl:

curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"messages": [
{
    "role": "system",
    "content": "You are Mistral.rs, an AI assistant."
},
{
    "role": "user",
    "content": "Write a story about Rust error handling."
}
]
}'

A streaming request can also be created by setting "stream": true in the request JSON. Please see this guide.

ℹ️ Requests whose prompt exceeds the model’s maximum context length now fail unless you opt in to truncation. Set "truncate_sequence": true to drop the oldest prompt tokens while reserving room (equal to max_tokens when provided, otherwise one token) for generation. Specifically, tokens from the front of the prompt are dropped.

GET: /v1/models

Returns the running models.

Example with curl:

curl http://localhost:<port>/v1/models

GET: / or /health

Returns the server health.

Example with curl:

curl http://localhost:<port>/health

GET: /docs

Returns OpenAPI API docs via SwaggerUI.

Example with curl:

curl http://localhost:<port>/docs

POST: /v1/completions

Process an OpenAI compatible completions request, returning an OpenAI compatible response when finished. Please find the official OpenAI API documentation here.

Completions-specific parameters

In addition to the common parameters listed above, the completions endpoint supports:

  • best_of: int | null. Generate best_of completions server-side and return the best one (the one with the highest log probability per token). When used with n, best_of must be greater than n.
  • echo: bool, default false. Echo back the prompt in addition to the completion.

To send a request with the Python openai library:

import openai

client = openai.OpenAI(
    base_url="http://localhost:1234/v1", # "http://<Your api-server IP>:port"
    api_key = "EMPTY"
)

completion = client.completions.create(
    model="default",
    prompt="What is Rust?",
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)

print(completion.choices[0].message)

Or with curl:

curl http://localhost:1234/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"prompt": "What is Rust?"
}'

ℹ️ The truncate_sequence flag behaves the same way for the completions endpoint: keep it false (default) to receive a validation error, or set it to true to trim the prompt automatically.

POST: /v1/embeddings

Serve an embedding model (for example, EmbeddingGemma) to enable this endpoint:

mistralrs serve -m google/embeddinggemma-300m

In multi-model mode, include an Embedding entry in your selector config to expose it alongside chat models.

Create vector embeddings via the OpenAI-compatible endpoint. Supported request fields:

  • input: a single string, an array of strings, an array of token IDs ([123, 456]), or a batch of token arrays ([[...], [...]]).
  • encoding_format: "float" (default) returns arrays of f32; "base64" returns Base64 strings.
  • dimensions: currently unsupported; providing it yields a validation error.
  • truncate_sequence: bool, default false. Set to true to clip over-length prompts instead of receiving a validation error.

ℹ️ Requests whose prompt exceeds the model’s maximum context length now fail unless you opt in to truncation. Embedding requests truncate tokens from the end of the prompt.

Example (Python openai client):

import openai

client = openai.OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="EMPTY",
)

result = client.embeddings.create(
    model="default",
    input=[
        "Embeddings capture semantic relationships between texts.",
        "What is graphene?",
    ],
    truncate_sequence=True,
)

for item in result.data:
    print(item.index, len(item.embedding))

Example with curl:

curl http://localhost:1234/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "default",
    "input": ["graphene conductivity", "superconductor basics"],
    "encoding_format": "base64",
    "truncate_sequence": false
  }'

Responses follow the OpenAI schema: object: "list", data[*].embedding containing either float arrays or Base64 strings depending on encoding_format, and a usage block (prompt_tokens, total_tokens). At present those counters report 0 because token accounting for embeddings is not yet implemented.

POST: /v1/images/generations

Generate images using diffusion models (like FLUX). First, serve a diffusion model:

mistralrs serve -m black-forest-labs/FLUX.1-schnell

Supported request fields:

  • model: Model identifier (use "default" to bypass validation)
  • prompt: Text description of the image to generate
  • n: Number of images to generate (default: 1)
  • response_format: "url" or "b64_json" (default: "url")
  • height: Image height in pixels (default: 720)
  • width: Image width in pixels (default: 1280)

Example with Python:

import openai
import base64

client = openai.OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="EMPTY",
)

response = client.images.generate(
    model="default",
    prompt="A majestic snow-covered mountain at sunset",
    n=1,
    response_format="b64_json",
    size="1280x720",  # width x height
)

# Save the generated image
image_data = base64.b64decode(response.data[0].b64_json)
with open("output.png", "wb") as f:
    f.write(image_data)

Example with curl:

curl http://localhost:1234/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "default",
    "prompt": "A majestic snow-covered mountain at sunset",
    "n": 1,
    "response_format": "b64_json",
    "height": 720,
    "width": 1280
  }'

POST: /v1/audio/speech

Generate speech from text using speech models (like Dia). First, serve a speech model:

mistralrs serve -m nari-labs/Dia-1.6B

Supported request fields:

  • model: Model identifier (use "default" to bypass validation)
  • input: Text to convert to speech. For Dia models, use speaker tags like [S1] and [S2] to control multiple voices
  • response_format: "wav" or "pcm" (only these formats are supported)

Note: The voice and instructions fields from the OpenAI API are currently ignored.

Example with Python:

import requests

response = requests.post(
    "http://localhost:1234/v1/audio/speech",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer EMPTY",
    },
    json={
        "model": "default",
        "input": "[S1] Hello, how are you today? [S2] I'm doing great, thanks for asking!",
        "response_format": "wav",
    },
)

# Save the audio file
with open("output.wav", "wb") as f:
    f.write(response.content)

Example with curl:

curl http://localhost:1234/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "default",
    "input": "[S1] Dia is an open weights text to dialogue model. [S2] Try it now!",
    "response_format": "wav"
  }' \
  --output output.wav

The response is raw audio data with the appropriate Content-Type header (audio/wav for WAV format, audio/pcm for PCM format).

POST: /v1/responses

Create a response using the OpenAI-compatible Responses API. Please find the official OpenAI API documentation here.

To send a request with the Python openai library:

import openai

client = openai.OpenAI(
    base_url="http://localhost:1234/v1",
    api_key = "EMPTY"
)

# First turn
resp1 = client.responses.create(
    model="default",
    input="Apples are delicious!"
)
print(resp1.output_text)

# Follow-up - no need to resend the first message
resp2 = client.responses.create(
    model="default",
    previous_response_id=resp1.id,
    input="Can you eat them?"
)
print(resp2.output_text)

Or with curl:

curl http://localhost:1234/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"input": "Tell me about Rust programming"
}'

# Follow-up using previous_response_id
curl http://localhost:1234/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"previous_response_id": "resp_12345-uuid-here",
"input": "What makes it memory safe?"
}'

The API also supports multimodal inputs (images, audio) and streaming responses by setting "stream": true in the request JSON.

ℹ️ The Responses API forwards truncate_sequence to underlying chat completions. Enable it if you want over-length conversations to be truncated rather than rejected.

GET: /v1/responses/{response_id}

Retrieve a previously created response by its ID.

Example with curl:

curl http://localhost:1234/v1/responses/resp_12345-uuid-here \
-H "Authorization: Bearer EMPTY"

DELETE: /v1/responses/{response_id}

Delete a stored response and its associated conversation history.

Example with curl:

curl -X DELETE http://localhost:1234/v1/responses/resp_12345-uuid-here \
-H "Authorization: Bearer EMPTY"

POST: /re_isq

Reapply ISQ to the model if possible. Pass the names as a JSON object with the key ggml_type to a string (the quantization level).

Example with curl:

curl http://localhost:<port>/re_isq -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{"ggml_type":"4"}'

Model Management Endpoints

These endpoints allow dynamic management of loaded models, enabling you to free memory by unloading models and reload them on demand.

POST: /v1/models/unload

Unload a model from memory while preserving its configuration for later reload. The model can be reloaded manually or will auto-reload when a request is sent to it.

Request body:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct"
}

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "unloaded"
}

Example with curl:

curl -X POST http://localhost:1234/v1/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

POST: /v1/models/reload

Manually reload a previously unloaded model. This is also triggered automatically when a request is sent to an unloaded model.

Request body:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct"
}

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "loaded"
}

Example with curl:

curl -X POST http://localhost:1234/v1/models/reload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

POST: /v1/models/status

Get the current status of a specific model.

Request body:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct"
}

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "loaded"
}

Example with curl:

curl -X POST http://localhost:1234/v1/models/status \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

Status Values

The status field in responses can be one of:

StatusDescription
loadedModel is loaded and ready to serve requests
unloadedModel is unloaded but can be reloaded
reloadingModel is currently being reloaded
not_foundModel ID not recognized
no_loader_configModel cannot be reloaded (missing loader configuration)
internal_errorAn internal error occurred (check error field for details)

When an error occurs, the response may include an error field with additional details:

{
  "model_id": "unknown-model",
  "status": "not_found",
  "error": null
}

Auto-Reload Behavior

When a request (e.g., chat completion) is sent to an unloaded model, the model will automatically reload before processing the request. This enables a “lazy loading” pattern where models are only loaded when needed, helping manage GPU memory efficiently.

Models List with Status

The /v1/models endpoint includes a status field for each model:

curl http://localhost:1234/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "default",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    },
    {
      "id": "meta-llama/Llama-3.2-3B-Instruct",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local",
      "status": "loaded"
    }
  ]
}

OpenResponses API

mistral.rs supports the OpenResponses API specification.

Endpoints

  • POST /v1/responses - Create a response
  • GET /v1/responses/{id} - Retrieve a response
  • DELETE /v1/responses/{id} - Delete a response
  • POST /v1/responses/{id}/cancel - Cancel a background response

Unsupported Parameters

The following parameters are accepted for API compatibility but will return errors if set to non-default values:

ParameterBehavior
parallel_tool_callsOnly true or omitted is supported; false returns an error
max_tool_callsNot supported; setting any value returns an error

mistral.rs Extensions

These additional parameters are available beyond the spec:

  • stop - Stop sequences
  • repetition_penalty - Token repetition penalty
  • top_k - Top-k sampling
  • grammar - Constrained generation grammar
  • min_p - Min-p sampling
  • dry_multiplier, dry_base, dry_allowed_length, dry_sequence_breakers - DRY sampling
  • web_search_options - Web search integration

See HTTP.md for usage examples.

Supported Models

Complete reference for model support in mistral.rs.

Model Categories

Text Models

  • Granite 4.0
  • SmolLM 3
  • DeepSeek V3
  • GPT-OSS
  • DeepSeek V2
  • Qwen 3 MoE
  • Phi 3.5 MoE
  • Qwen 3
  • GLM 4
  • GLM-4.7-Flash
  • GLM-4.7 (MoE)
  • Gemma 2
  • Qwen 2
  • Starcoder 2
  • Phi 3
  • Mixtral
  • Phi 2
  • Gemma
  • Llama
  • Mistral

Vision Models

  • Qwen 3-VL
  • Gemma 3n
  • Llama 4
  • Gemma 3
  • Mistral 3
  • Phi 4 multimodal
  • Qwen 2.5-VL
  • MiniCPM-O
  • Llama 3.2 Vision
  • Qwen 2-VL
  • Idefics 3
  • Idefics 2
  • LLaVA Next
  • LLaVA
  • Phi 3V

Speech Models

  • Dia

Image Generation Models

  • FLUX

Embedding Models

  • Embedding Gemma
  • Qwen 3 Embedding

Request a new model

Supported GGUF Architectures

Plain:

  • llama
  • phi2
  • phi3
  • starcoder2
  • qwen2
  • qwen3

With adapters:

  • llama
  • phi3

Quantization Support

ModelGGUFGGMLISQ
Mistral
Gemma
Llama
Mixtral
Phi 2
Phi 3
Phi 3.5 MoE
Qwen 2.5
Phi 3 Vision
Idefics 2
Gemma 2
GLM4
GLM-4.7-Flash (MoE)
GLM-4.7 (MoE)
Starcoder 2
LLaVa Next
LLaVa
Llama 3.2 Vision
Qwen2-VL
Idefics 3
Deepseek V2
Deepseek V3
MiniCPM-O 2.6
Qwen2.5-VL
Gemma 3
Mistral 3
Llama 4
Qwen 3
SmolLM3
Dia 1.6b
Gemma 3n
Qwen 3 VL
Granite 4.0
GPT-OSS

Device Mapping Support

Model categorySupported
Plain
GGUF
GGML
Vision Plain

X-LoRA and LoRA Support

ModelX-LoRAX-LoRA+GGUFX-LoRA+GGML
Mistral
Gemma
Llama
Mixtral
Phi 2
Phi 3
Phi 3.5 MoE
Qwen 2.5
Phi 3 Vision
Idefics 2
Gemma 2
GLM4
GLM-4.7-Flash (MoE)
GLM-4.7 (MoE)
Starcoder 2
LLaVa Next
LLaVa
Qwen2-VL
Idefics 3
Deepseek V2
Deepseek V3
MiniCPM-O 2.6
Qwen2.5-VL
Gemma 3
Mistral 3
Llama 4
Qwen 3
SmolLM3
Gemma 3n
Qwen 3 VL
Granite 4.0
GPT-OSS

AnyMoE Support

ModelAnyMoE
Mistral 7B
Gemma
Llama
Mixtral
Phi 2
Phi 3
Phi 3.5 MoE
Qwen 2.5
Phi 3 Vision
Idefics 2
Gemma 2
GLM-4.7-Flash (MoE)
GLM-4.7 (MoE)
Starcoder 2
LLaVa Next
LLaVa
Llama 3.2 Vision
Qwen2-VL
Idefics 3
Deepseek V2
Deepseek V3
MiniCPM-O 2.6
Qwen2.5-VL
Gemma 3
Mistral 3
Llama 4
Qwen 3
SmolLM3
Gemma 3n
Qwen 3 VL
Granite 4.0
GPT-OSS

Using Derivative Models

Model type is auto-detected. Use flags for quantized models and adapters:

Model TypeRequired Arguments
Plain-m <model-id>
GGUF Quantized-m <model-id> --format gguf -f <file>
ISQ Quantized-m <model-id> --isq <level>
UQFF Quantized-m <model-id> --from-uqff <file>
LoRA-m <model-id> --lora <adapter>
X-LoRA-m <model-id> --xlora <adapter> --xlora-order <file>

Example: Zephyr GGUF model

mistralrs serve -p 1234 --log output.txt --format gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf

Chat Templates and Tokenizer

Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized.

Vision model support in mistral.rs

Mistral.rs supports various modalities of models, including vision models. Vision models take images and text as input and have the capability to reason over both.

Please see docs for the following model types:

Note for the Python and HTTP APIs: We follow the OpenAI specification for structuring the image messages and allow both base64 encoded images as well as a URL/path to the image. There are many examples of this, see this Python example.

Image generation model support in mistral.rs

Mistral.rs supports various modalities of models, including image generation models. Image generation models take text as input and generate images.

Please see docs for the following model types:

Embeddings Overview

Mistral.rs can load embedding models alongside chat, vision, diffusion, and speech workloads. Embedding models produce dense vector representations that you can use for similarity search, clustering, reranking, and other semantic tasks.

Supported models

ModelNotesDocumentation
EmbeddingGemmaGoogle’s multilingual embedding model.EMBEDDINGGEMMA.md
Qwen3 EmbeddingQwen’s general-purpose embedding encoder.QWEN3_EMBEDDING.md

Have another embedding model you would like supported? Open an issue with the model ID and configuration.

Usage overview

  1. Choose a model from the table above.
  2. Load it through one of our APIs:
    • CLI/HTTP
    • Python
    • Rust

Detailed examples for each model live in their dedicated documentation pages.

DeepSeek V2: deepseek-ai/DeepSeek-V2-Lite

The DeepSeek V2 is a mixture of expert (MoE) model featuring “Multi-head Latent Attention”.

  • Context length of 32k tokens (Lite model), 128k tokens (full model)
  • 64 routed experts (Lite model), 160 routed experts (full model)
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-V2-Lite

Note

This model supports MoQE which can be activated in the ISQ organization parameter within the various APIs, as demonstrated below:

mistralrs run --isq 4 -m deepseek-ai/DeepSeek-V2-Lite --isq-organization moqe

HTTP API

mistralrs serve --isq 4 -p 1234 -m deepseek-ai/DeepSeek-V2-Lite
import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="deepseek-ai/DeepSeek-V2-Lite",
        arch=Architecture.DeepseekV2,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("deepseek-ai/DeepSeek-V2-Lite")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

DeepSeek V3: deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-R1

The DeepSeek V3 is a mixture of expert (MoE) model.

mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1

Note

The non-distill versions of the DeepSeek R1 models share the DeepSeek V3 architecture.

Note

This model supports MoQE which can be activated in the ISQ organization parameter within the various APIs, as demonstrated below:

mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1 --isq-organization moqe

Running the distill models

The various distillation models can be run out of the box.

mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
mistralrs run --isq 4 -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

HTTP API

mistralrs serve --isq 4 -p 1234 -m deepseek-ai/DeepSeek-R1
import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="deepseek-ai/DeepSeek-R1",
        arch=Architecture.DeepseekV3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("deepseek-ai/DeepSeek-R1")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Gemma 2 Model

See the Gemma 2 model Collection

The Gemma 2 models are a family of text-to-text decoder-only LLMs. As such, the methods to use them are the same as with all other text-to-text LLMs supported by mistral.rs.

HTTP API

import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="google/gemma-2-9b-it",
        arch=Architecture.Gemma2,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Gemma 3 Model: google/gemma-3-4b-it

The Gemma 3 model is a family of multimodal (text+vision) models with 128k context length. The collection can be found here, with model sizes ranging from 4B to 27B.

We support the Gemma 3 Model in the Rust, Python, and HTTP APIs, including ISQ for increased performance.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image:

Mount Washington
Credit

Prompt:

What is this?

Output:

image shows Mount Washington in New Hampshire, USA. It's a prominent peak in the White Mountains, known for its extreme weather conditions and being the highest peak in the Northeastern United States. The image captures it covered in snow with a dramatic sky above. The structures at the summit are communication towers.



The winding path visible on the mountain slopes appears to be part of the Mount Washington Auto Road, a historic road that allows vehicles to drive to the summit.

  1. Start the server
mistralrs serve vision -p 1234 -m google/gemma-3-12b-it
  1. Send a request
from openai import OpenAI
import httpx
import textwrap
import json


client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")


completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is this?",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

This is a minimal example of running the Gemma 3 model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("google/gemma-3-12b-it")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
        &model,
    )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3-12b-it",
        arch=VisionArchitecture.Gemma3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is this?",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Gemma 3n Model: google/gemma-3n-E4B-it

Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs. These models support over 140 spoken languages.

The Gemma 3n Model has support in the Rust, Python, and HTTP APIs. Additionally, the Gemma 3n Model supports ISQ for increased performance.

  • Full multimodal support: mistral.rs supports text, audio, and vision inputs to Gemma 3n!

  • 🪆 mistral.rs supports dynamically resizing the Gemma 3n model with that MatFormer architecture!

    Gemma 3n implements the MatFormer architecture, which allows one model to be resized dynamically and tune performance on resource-constrained systems.

    Mistral.rs supports this feature!

    You can access it using the matformer_config_path (example config) and matformer_slice_name arguments throughout the APIs.

  • Prequantized UQFF models:

Using MatFormer with Gemma 3n

MatFormer allows you to dynamically adjust the model size based on your resource constraints. The Gemma 3n model comes with several pre-configured slices that offer different performance/resource trade-offs.

You can read more about MatFormer in mistral.rs here.

Available Slices

The default configuration file (matformer_configs/gemma3n.csv) includes:

  • Main model (3.98B params, 35 layers) - Full model with best performance
  • Config for official E2B Model (1.91B params, 30 layers) - Balanced performance/efficiency
  • Various intermediate configurations from E1.96B to E3.79B with different layer and FFN configurations

Command Line Example

# Run with the E2.49B slice for balanced performance/efficiency
mistralrs run vision -m google/gemma-3n-E4B-it \
  --matformer-config-path matformer_configs/gemma3n.csv \
  --matformer-slice-name "Config for E2.49B (block-level)"

Python SDK Example

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

# Use the E2.49B slice for balanced performance/efficiency
runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3n-E4B-it",
        arch=VisionArchitecture.Gemma3n,
        matformer_config_path="matformer_configs/gemma3n.csv",
        matformer_slice_name="Config for E2.49B (block-level)",
    ),
)

# The model will use 35 layers with mixed FFN dimensions (4096 for early layers, 8192 for middle)
# This results in ~37% parameter reduction while maintaining better performance than E2B
res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="ignore",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What do you see in this image?",
                    },
                ],
            }
        ],
        max_tokens=100,
    )
)
print(res.choices[0].message.content)

Rust SDK Example

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<()> {
    // Build model with MatFormer E2.49B configuration
    let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
        .with_isq(IsqType::Q4K)
        .with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
        .with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
        .with_logging()
        .build()
        .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "Describe this image briefly.",
        image,
        &model,
    )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    println!("Using E2.49B slice: 35 layers, 2.49B effective params");
    
    Ok(())
}

Choosing the Right Slice

  • Resource-constrained environments: Use “Config for official E2B Model” (1.91B params)
  • Balanced performance: Try E2.49B to E2.98B configurations (block-level configs offer better balance)
  • Maximum quality: Use “Main model” (3.98B params) or omit MatFormer configuration entirely

The slice selection allows you to:

  • Reduce memory usage proportionally to the parameter count
  • Speed up inference roughly linearly with the number of layers
  • Maintain acceptable quality for many use cases with smaller slices

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image: Mount Washington

Credit

Prompt:

Please describe this image in detail.

Output:

The image captures a breathtaking, wide-angle view of a majestic mountain covered in a blanket of snow. The mountain dominates the frame, its peak reaching towards a partly cloudy sky. The snow cover is uneven, with patches of exposed dark rock and textured snow formations creating a visually interesting surface. 

A winding, snow-covered path or road snakes its way up the mountainside, appearing as a bright white line against the darker slopes. This path draws the eye upwards towards the summit, where a few structures, possibly communication towers or observation points, are visible. 

The lower slopes of the mountain are covered in a dense forest of evergreen trees, their dark green hues contrasting beautifully with the white snow. The forest extends down into a valley, hinting at a wider landscape beyond the frame. 

The sky above is a mix of pale blue and soft grey clouds, with some darker, more dramatic cloud formations near the top of the mountain. The lighting suggests it might be early morning or late afternoon, casting subtle shadows across the mountain's surface and highlighting its contours. 

The overall impression is one of grandeur, tranquility, and the raw beauty of a winter landscape. The scale of the mountain is impressive, and the winding path invites a sense of exploration and adventure.

  1. Start the server
mistralrs serve vision -p 1234 -m google/gemma-3n-E4B-it

# Or with MatFormer for balanced performance:
mistralrs serve vision -p 1234 -m google/gemma-3n-E4B-it \
  --matformer-config-path matformer_configs/gemma3n.csv \
  --matformer-slice-name "Config for E2.49B (block-level)"
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="ignore",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "Please describe this image in detail.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

This is a minimal example of running the Gemma 3n model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("google/gemma-3n-E4B-it")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "Please describe the image in detail.",
        image,
        &model,
    )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3n-E4B-it",
        arch=VisionArchitecture.Gemma3n,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="ignore",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "Please describe this image in detail.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

OpenAI HTTP API

Audio is delivered with the audio_url content-type that mirrors OpenAIʼs official specification:

{
  "role": "user",
  "content": [
    {
      "type": "audio_url",
      "audio_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg" }
    },
    {
      "type": "image_url",
      "image_url": { "url": "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg" }
    },
    {
      "type": "text",
      "text": "Describe what is happening in this clip in as much detail as possible."
    }
  ]
}

Rust SDK

use anyhow::Result;
use mistralrs::{AudioInput, IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let audio_bytes = reqwest::blocking::get(
        "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg",
    )?
    .bytes()?
    .to_vec();
    let audio = AudioInput::from_bytes(&audio_bytes)?;

    let image_bytes = reqwest::blocking::get(
        "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg",
    )?
    .bytes()?
    .to_vec();
    let image = image::load_from_memory(&image_bytes)?;

    let messages = VisionMessages::new()
        .add_multimodal_message(
            TextMessageRole::User,
            "Describe in detail what is happening.",
            vec![image],
            vec![audio],
            &model,
        )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    Ok(())
}

With this, you now have a single-call pipeline that fuses sound, vision, and text – all running locally through mistral.rs! 🔥

GLM4 Model

See the GLM4 model Collection

GLM4 is a series of open, multilingual, and multimodal large language models. The text-to-text LLM backbones in GLM4 are supported by mistral.rs.

HTTP API

import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="THUDM/GLM-4-9B-0414",
        arch=Architecture.GLM4,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

GLM-4.7-Flash (MoE): zai-org/GLM-4.7-Flash

GLM-4.7-Flash is a mixture of experts (MoE) model from the GLM family with MLA (Multi-head Latent Attention) architecture.

HTTP API

Start the server:

mistralrs serve --isq 4 -p 1234 -m zai-org/GLM-4.7-Flash

Send requests using an OpenAI-compatible client:

import openai

client = openai.Client(base_url="http://localhost:1234/v1", api_key="foobar")

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="zai-org/GLM-4.7-Flash",
        arch=Architecture.GLM4MoeLite,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, TextModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("zai-org/GLM-4.7-Flash")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

GLM-4.7 (MoE): zai-org/GLM-4.7

GLM-4.7 is a mixture of experts (MoE) model from the GLM family with standard GQA attention and partial RoPE.

HTTP API

Start the server:

mistralrs serve --isq 4 -p 1234 -m zai-org/GLM-4.7

Send requests using an OpenAI-compatible client:

import openai

client = openai.Client(base_url="http://localhost:1234/v1", api_key="foobar")

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="zai-org/GLM-4.7",
        arch=Architecture.GLM4Moe,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, TextModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("zai-org/GLM-4.7")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

GPT-OSS

GPT-OSS is a Mixture of Experts (MoE) language model with specialized attention mechanisms and efficient quantization. Key features include:

  • MXFP4 quantized MoE experts for efficient inference
  • Per-head attention sinks for improved attention patterns
  • YARN RoPE scaling for extended context
  • Hybrid cache supporting both full and sliding window attention
mistralrs run -m openai/gpt-oss-20b

Note: GPT-OSS MoE experts are pre-quantized in MXFP4 format. ISQ can be applied to attention layers only.

Note: PagedAttention is not supported for GPT-OSS due to custom attention with sinks.

HTTP API

You can find a more detailed example here.

mistralrs serve -p 1234 -m openai/gpt-oss-20b
import openai

client = openai.OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})

while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

You can find a more detailed example here.

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="openai/gpt-oss-20b",
        arch=Architecture.GptOss,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find a more detailed example here.

use anyhow::Result;
use mistralrs::{TextMessageRole, TextMessages, TextModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("openai/gpt-oss-20b")
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Technical Details

MXFP4 Quantization

GPT-OSS MoE experts use MXFP4 (4-bit microscaling floating point) quantization for compact and efficient storage:

  • gate_up_proj: Packed experts with MXFP4 weights
  • down_proj: Packed experts with MXFP4 weights
  • Scales stored at 1 byte per 32 elements

Attention with Sinks

The model uses per-head attention sinks that are added to attention logits before softmax, helping to regularize attention patterns. This custom attention mechanism is incompatible with PagedAttention.

ISQ Support

In-situ quantization (ISQ) can be applied to attention projection layers:

  • q_proj, k_proj, v_proj, o_proj
  • lm_head

MoE expert layers are already MXFP4 quantized and excluded from ISQ.

Qwen 3: collection

The Qwen 3 family is a collection of hybrid reasoning MoE and non-MoE models ranging from 0.6b to 235b parameters.

mistralrs run --isq 4 -m Qwen/Qwen3-8B
mistralrs run --isq 4 -m Qwen/Qwen3-30B-A3B

Note: mistral.rs can load all FP8 pre-quantized versions natively! Simply replace the model ID.

Note: tool calling support is fully implemented for the Qwen 3 models, including agentic web search.

Enabling thinking

The Qwen 3 models are hybrid reasoning models which can be controlled at inference-time. By default, reasoning is enabled for these models. To dynamically control this, it is recommended to either add /no_think or /think to your prompt. Alternatively, you can specify the enable_thinking flag as detailed by the API-specific examples.

HTTP API

You can find a more detailed example demonstrating enabling/disabling thinking here.

mistralrs serve --isq 4 -p 1234 -m Qwen/Qwen3-8B
import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
        # enable_thinking=False,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

You can find a more detailed example demonstrating enabling/disabling thinking here.

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="Qwen/Qwen3-8B",
        arch=Architecture.Qwen3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
        # enable_thinking=False,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find a more detailed example demonstrating enabling/disabling thinking here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("Qwen/Qwen3-8B")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        // .enable_thinking(false)
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

SmolLM3: HuggingFaceTB/SmolLM3-3B

SmolLM3 is a 3B parameter long-context hybrid reasoning language model. It supports 6 languages, advanced reasoning and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale.

Default, easiest:

mistralrs run --isq 8 -m HuggingFaceTB/SmolLM3-3B

UQFF prequantized:

mistralrs run -m EricB/SmolLM3-3B-UQFF --from-uqff smollm33b-q4k-0.uqff

Note: tool calling support is fully implemented for the SmolLM3 models, including agentic web search.

Check out prequantized UQFF SmolLM3 here: https://huggingface.co/EricB/SmolLM3-3B-UQFF

Enabling thinking

The SmolLM3 models are hybrid reasoning models which can be controlled at inference-time. By default, reasoning is enabled for these models. To dynamically control this, it is recommended to either add /no_think or /think to your prompt. Alternatively, you can specify the enable_thinking flag as detailed by the API-specific examples.

HTTP API

You can find a more detailed example demonstrating enabling/disabling thinking here.

mistralrs serve --isq 8 -p 1234 -m HuggingFaceTB/SmolLM3-3B
import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
        # enable_thinking=False,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

You can find a more detailed example demonstrating enabling/disabling thinking here.

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="HuggingFaceTB/SmolLM3-3B",
        arch=Architecture.SmolLm3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
        # enable_thinking=False,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find a more detailed example demonstrating enabling/disabling thinking here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("HuggingFaceTB/SmolLM3-3B")
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        // .enable_thinking(false)
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Idefics 2 Model: HuggingFaceM4/idefics2-8b-chatty

The Idefics 2 Model has support in the Rust, Python, and HTTP APIs. The Idefics 2 Model also supports ISQ for increased performance.

Note: Some of examples use our Cephalo model series but could be used with any model ID.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image:

Prompt:

What is shown in this image?

Output:

The image depicts a group of orange ants climbing over a black pole. The ants are moving in the same direction, forming a line as they ascend the pole.

  1. Start the server
mistralrs serve vision -p 1234 --isq 4 -m HuggingFaceM4/idefics2-8b-chatty
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image?",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

Rust

You can find this example here.

This is a minimal example of running the Idefics 2 model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new(
        "HuggingFaceM4/idefics2-8b-chatty",
    )
    .with_isq(IsqType::Q4K)
    .with_logging()
    .build()
    .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_idefics_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="lamm-mit/Cephalo-Idefics-2-vision-8b-beta",
        arch=VisionArchitecture.Idefics2,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image?",
                    },
                ],
            },
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Idefics 3 Vision: HuggingFaceM4/Idefics3-8B-Llama3

Mistral.rs supports the Idefics 3 vision model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.

UQFF quantizations are also available.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. Check the Hugging Face text model config for more information or raise an issue.

ToC

Using the 🤗 Smol VLM models

Simply substitute the Idefics 3 model ID (HuggingFaceM4/Idefics3-8B-Llama3) with the Smol VLM one (HuggingFaceTB/SmolVLM-Instruct)!

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

  1. Start up interactive mode with the Idefics 3 model
mistralrs run vision --isq 4 -m HuggingFaceM4/Idefics3-8B-Llama3
  1. Ask a question
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image depicts a single, large, red rose in full bloom. The rose is positioned against a blurred background that suggests a natural setting, possibly outdoors. The petals of the rose are vividly red with a slight sheen, indicating that they are wet, likely from recent rainfall or dew. The petals are tightly packed and have a velvety texture, which is characteristic of roses. The edges of the petals are slightly curled and appear to be glistening with water droplets, enhancing the overall freshness and beauty of the flower.

The stem of the rose is visible and appears to be green, with a few small thorns scattered along its length. The stem is slender and supports the weight of the large, showy head of the rose. The leaves that accompany the stem are not fully visible in the image but are implied by the presence of the stem.

The background is out of focus, which helps to emphasize the rose as the main subject of the image. The blurred background suggests a natural environment, possibly a garden or a field, with hints of greenery and possibly other flowers or plants. The lighting in the image is natural, likely from sunlight, which casts soft shadows on the petals and adds depth to the scene.

The overall composition of the image focuses on the rose, making it the central point of interest. The wetness of the petals adds a dynamic element to the stillness of the flower, giving it a sense of life and vitality. This could symbolize themes of beauty, nature, and perhaps even passion or love.

In summary, this image captures a single red rose in full bloom with wet petals against a blurred natural background. The rose is the focal point, with its vibrant red color and glistening petals drawing attention. The natural lighting and out-of-focus background enhance the beauty and freshness of the flower.
  1. Continue the chat by passing another image.
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image depicts a single, large, red rose in full bloom. The rose is positioned against a blurred background that suggests a natural setting, possibly outdoors. The petals of the rose are vividly red with a slight sheen, indicating that they are wet, likely from recent rainfall or dew. The petals are tightly packed and have a velvety texture, which is characteristic of roses. The edges of the petals are slightly curled and appear to be glistening with water droplets, enhancing the overall freshness and beauty of the flower.

The stem of the rose is visible and appears to be green, with a few small thorns scattered along its length. The stem is slender and supports the weight of the large, showy head of the rose. The leaves that accompany the stem are not fully visible in the image but are implied by the presence of the stem.

The background is out of focus, which helps to emphasize the rose as the main subject of the image. The blurred background suggests a natural environment, possibly a garden or a field, with hints of greenery and possibly other flowers or plants. The lighting in the image is natural, likely from sunlight, which casts soft shadows on the petals and adds depth to the scene.

The overall composition of the image focuses on the rose, making it the central point of interest. The wetness of the petals adds a dynamic element to the stillness of the flower, giving it a sense of life and vitality. This could symbolize themes of beauty, nature, and perhaps even passion or love.

In summary, this image captures a single red rose in full bloom with wet petals against a blurred natural background. The rose is the focal point, with its vibrant red color and glistening petals drawing attention. The natural lighting and out-of-focus background enhance the beauty and freshness of the flower.
> \image https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg What mountain is this?
The mountain is Mount Washington.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

The image depicts a majestic mountain landscape under a partly cloudy sky, characterized by its rugged and snow-covered peaks. The mountain is prominently featured in the center of the image, showcasing its expansive and undulating terrain. The summit of the mountain is capped with snow, indicating that it might be winter or early springtime.

The slopes of the mountain are steep and uneven, covered with patches of snow that appear to have been recently fallen or freshly groomed for skiing or other winter activities. There are visible ski trails descending from the summit down towards what seems to be a valley below, suggesting that this location could be a popular ski resort area.

In addition to the main peak, there are smaller hills and ridges surrounding it on both sides. These secondary peaks also have varying degrees of snow cover but appear less prominent than the central peak.

The sky above is mostly overcast with clouds covering most parts but allowing some sunlight to peek through in certain areas, casting soft shadows on parts of the mountainside. This lighting suggests that it might not be midday yet as there isn't an intense brightness typical for noon hours.

On closer inspection near one side of this grandeur scene stands tall trees without leaves; their bare branches starkly contrasting against both white snow and blue sky create an interesting... (cut off)

  1. Start the server
mistralrs serve vision -p 1234 --isq 4 -m HuggingFaceM4/Idefics3-8B-Llama3
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "HuggingFaceM4/Idefics3-8B-Llama3";

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new(MODEL_ID)
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .build()
        .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
        &model,
    )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="HuggingFaceM4/Idefics3-8B-Llama3",
        arch=VisionArchitecture.Idefics3,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image?",
                    },
                ],
            },
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

UQFF models

Coming soon!

LLaVA and LLaVANext Model: llava-hf model family

The LLaVA and LLaVANext are great multimodal models that can handle both text and vision inputs.

This implementation supports both LLaVA and LLaVANext(which adds multi resolution image processing) and two types of LLM base model: llama and mistral. Currently it is tested on:

  • llava-hf/llava-v1.6-mistral-7b-hf
  • llava-hf/llava-v1.6-vicuna-7b-hf
  • llava-hf/llava-1.5-7b-hf

The LLaVA and LLaVANext Model has support in the Rust, Python, and HTTP APIs. The LLaVA and LLaVANext Model also supports ISQ for increased performance.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image: Mount Washington

Credit

Prompt:

What is shown in this image?

Output:

Text: The image shows a steep, snow-covered hillside with a pine tree on the right side, close to the top. The landscape appears to be a mountainous area with winter conditions. There are no visible humans or permanent structures in the immediate vicinity that suggest this is a summer or recreational location. It's likely a cold, snowy day or season, and the slopes might be part of a mountainous region.

  1. Start the server
mistralrs serve vision -p 1234 --isq 4 -m llava-hf/llava-v1.6-mistral-7b-hf
# or for vicuna backend, specify the chat template:
mistralrs serve vision -p 1234 --isq 4 -c ./chat_templates/vicuna.json -m llava-hf/llava-v1.6-vicuna-7b-hf
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image?",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

Rust

You can find this example here.

This is a minimal example of running the LLaVA and LLaVANext model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new(
        "llava-hf/llava-v1.6-mistral-7b-hf",
    )
    .with_isq(IsqType::Q4K)
    .with_logging()
    .build()
    .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_llava_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="llava-hf/llava-v1.6-mistral-7b-hf",
        arch=VisionArchitecture.LLaVANext,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image?",
                    },
                ],
            },
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Llama 3.2 Vision Model: meta-llama/Llama-3.2-11B-Vision-Instruct

Mistral.rs supports the Llama 3.2 vision model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.

UQFF quantizations are also available.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: Some examples use the Cephalo Llama 3.2 model, a member of the Cephalo model collection. This model is finetune of Llama 3.2 with enhanced capabilities in scientific images. To use the base Llama 3.2 Vision model, simply use the associated model ID.

Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. The text model has 40 layers.

ToC

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

https://github.com/user-attachments/assets/4d11c35c-9ea2-42b8-8cab-5f7e8e2ee9ff

  1. Start up interactive mode with the Llama 3.2 model
mistralrs run vision --isq 4 -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k
  1. Say hello!
> Hello!
How can I assist you today?
  1. Pass the model an image and ask a question.
> Hello!
How can I assist you today?
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image shows a close-up view of a rose flower with dew drops on its petals. The rose is in full bloom, with its petals unfolding and displaying vibrant pink coloration. The dew drops on the petals create a delicate, glistening effect, adding to the overall visual appeal of the flower. The background is blurred, focusing attention on the intricate details of the rose.
  1. Continue the chat by passing another image.
> Hello!
How can I assist you today?
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image shows a close-up view of a rose flower with dew drops on its petals. The rose is in full bloom, with its petals unfolding and displaying vibrant pink coloration. The dew drops on the petals create a delicate, glistening effect, adding to the overall visual appeal of the flower. The background is blurred, focusing attention on the intricate details of the rose.
> \image https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg What mountain is this?
The image appears to be of Mount Washington, which is the highest peak in the Northeastern United States. It is located in the White Mountains of New Hampshire and is known for its extreme weather conditions, including high winds and low temperatures. The mountain's summit reaches an elevation of approximately 6,288 feet (1,917 meters) above sea level.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

The image shows Mount Washington, the highest peak in the Northeastern United States, located in the White Mountains of New Hampshire. The scene captures the mountain's rugged terrain and varied landscape features. 

In the foreground, there are dense forests of coniferous trees, primarily spruce and fir, which are typical of the region's boreal forest ecosystem. The trees are densely packed, indicating a high level of vegetation cover and biodiversity.

Moving upwards, the image reveals rocky outcroppings and boulders scattered across the slope, indicating the mountain's geological history of glacial activity. The presence of these rocks suggests that the area was once covered by ice sheets during the last ice age, which carved out the landscape and left behind a mix of boulders and talus slopes.

In the mid-ground, the image shows a series of ridges and valleys, which are characteristic of the mountain's glacially sculpted terrain. These features were formed by the movement of ice sheets that carved out U-shaped valleys and left behind a series of rounded hills and ridges.

At the summit, there is a prominent observation tower or weather station, which is likely used for scientific research and weather monitoring. The structure is situated at an elevation of approximately 6,288 feet (1,917 meters) above sea level, making it one of the highest points in the region.

The image also captures the atmospheric conditions on Mount Washington, with clouds and mist visible in the background. The mountain's unique location in a region where cold Arctic air meets warm moist air from the Gulf Stream creates a unique microclimate known as the "Home Rule," where extreme weather conditions can occur.

Overall, the image showcases the diverse geological and ecological features of Mount Washington, highlighting its role as a significant natural landmark in the Northeastern United States.

  1. Start the server
mistralrs serve vision -p 1234 --isq 4 -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k";

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new(MODEL_ID)
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

MODEL_ID = "lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k"

runner = Runner(
    which=Which.VisionPlain(
        model_id=MODEL_ID,
        arch=VisionArchitecture.VLlama,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

UQFF models

UQFF is a quantized file format similar to GGUF based on ISQ. It removes the memory and compute requirements that come with ISQ by providing ready-made quantizations! The key advantage over GGUF is the flexibility to store multiple quantizations in one file.

We provide UQFF files (EricB/Llama-3.2-11B-Vision-Instruct-UQFF) for this Llama 3.2 Vision model.

You can use these UQFF files to easily use quantized versions of Llama 3.2 Vision.

For example:

mistralrs run -m meta-llama/Llama-3.2-11B-Vision-Instruct --from-uqff EricB/Llama-3.2-11B-Vision-Instruct-UQFF/llama-3.2-11b-vision-q4k.uqff

Llama 4 Series: meta-llama/Llama-4-Scout-17B-16E-Instruct

🚧 We are preparing a collection of UQFF quantized models! 🚧


The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

Architecture:

  • Efficient inference: 17B activated parameters
  • Very sparse: 1 activated expert for both Scout (of 16), and Maverick (of 128)
  • RoPE enhancement: iRoPE enables high context-length functionality

Integration in mistral.rs:

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image:

Credit

Prompt:

Please describe this image in detail.

Output:

The image presents a breathtaking mountain landscape, with a snow-capped peak dominating the scene. The mountain's rugged terrain is characterized by numerous ridges and valleys, while its summit is adorned with several structures that appear to be communication towers or antennas.

**Key Features:**

* **Mountain:** The mountain is the central focus of the image, showcasing a mix of snow-covered and bare areas.
* **Sky:** The sky above the mountain features a dramatic display of clouds, with dark grey clouds at the top gradually giving way to lighter blue skies towards the bottom.
* **Valley:** In the foreground, a valley stretches out, covered in trees that are mostly bare, suggesting a winter setting.
* **Lighting:** The lighting in the image is striking, with the sun casting a warm glow on the mountain's snow-covered slopes while leaving the surrounding areas in shadow.

**Overall Impression:**

The image exudes a sense of serenity and majesty, capturing the beauty of nature in a dramatic and awe-inspiring way. The contrast between the snow-covered mountain and the bare trees in the valley creates a visually appealing scene that invites the viewer to appreciate the natural world.

  1. Start the server
mistralrs serve vision -p 1234 --isq 4 -m meta-llama/Llama-4-Scout-17B-16E-Instruct
  1. Send a request
from openai import OpenAI
import httpx
import textwrap
import json


client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")


completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "Please describe this image in detail.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

This is a minimal example of running the Llama 4 model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new(
        "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    )
    .with_isq(IsqType::Q4K)
    .with_logging()
    .build()
    .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is this?",
        image,
        &model,
    )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        arch=VisionArchitecture.Llama4,
    ),
    in_situ_quant="4",
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is this?",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

MiniCPM-O 2.6 Model: openbmb/MiniCPM-o-2_6

Mistral.rs supports the MiniCPM-O 2.6 model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.

UQFF quantizations are coming soon.

Note

Only the vision portion of this model has been implemented. No audio features are supported yet.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

ToC

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

  1. Start up interactive mode with the MiniCPM-O 2.6 Model model
mistralrs run vision --isq 4 -m openbmb/MiniCPM-o-2_6
  1. Say hello!
> Hello!
How can I assist you today?
  1. Pass the model an image and ask a question.
> Hello!
How can I assist you today?
> \image https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Rosa_Precious_platinum.jpg/220px-Rosa_Precious_platinum.jpg What is this image?
The image shows a close-up view of a rose flower with dew drops on its petals. The rose is in full bloom, with its petals unfolding and displaying vibrant pink coloration. The dew drops on the petals create a delicate, glistening effect, adding to the overall visual appeal of the flower. The background is blurred, focusing attention on the intricate details of the rose.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

The image shows Mount Washington, the highest peak in the Northeastern United States, located in the White Mountains of New Hampshire. The scene captures the mountain's rugged terrain and varied landscape features. 

In the foreground, there are dense forests of coniferous trees, primarily spruce and fir, which are typical of the region's boreal forest ecosystem. The trees are densely packed, indicating a high level of vegetation cover and biodiversity.

Moving upwards, the image reveals rocky outcroppings and boulders scattered across the slope, indicating the mountain's geological history of glacial activity. The presence of these rocks suggests that the area was once covered by ice sheets during the last ice age, which carved out the landscape and left behind a mix of boulders and talus slopes.

In the mid-ground, the image shows a series of ridges and valleys, which are characteristic of the mountain's glacially sculpted terrain. These features were formed by the movement of ice sheets that carved out U-shaped valleys and left behind a series of rounded hills and ridges.

At the summit, there is a prominent observation tower or weather station, which is likely used for scientific research and weather monitoring. The structure is situated at an elevation of approximately 6,288 feet (1,917 meters) above sea level, making it one of the highest points in the region.

The image also captures the atmospheric conditions on Mount Washington, with clouds and mist visible in the background. The mountain's unique location in a region where cold Arctic air meets warm moist air from the Gulf Stream creates a unique microclimate known as the "Home Rule," where extreme weather conditions can occur.

Overall, the image showcases the diverse geological and ecological features of Mount Washington, highlighting its role as a significant natural landmark in the Northeastern United States.

  1. Start the server
mistralrs serve vision -p 1234 --isq 4 -m openbmb/MiniCPM-o-2_6
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "openbmb/MiniCPM-o-2_6";

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new(MODEL_ID)
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

MODEL_ID = "openbmb/MiniCPM-o-2_6"

runner = Runner(
    which=Which.VisionPlain(
        model_id=MODEL_ID,
        arch=VisionArchitecture.MiniCpmO,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Mistral Small 3.1 Model: mistralai/Mistral-Small-3.1-24B-Instruct-2503

The Mistral Small 3.1 model is a strong multimodal (text+vision) model with 128k context length, function calling, and strong visual understanding.

We support the Mistral 3 Model in the Rust, Python, and HTTP APIs, including ISQ for increased performance.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

Tool calling with Mistral Small 3.1

The Mistral Small 3.1 model itself does not come with the correct JINJA chat template to enable tool calling. We provide a chat template for tool calling with Mistral Small 3.1, and you can use it by specifying the jinja_explicit parameter in the various APIs. For example:

mistralrs serve -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image:

Credit

Prompt:

What is this?

Output:

The image shows a close-up of a vibrant flower with pink petals and a central cluster of yellowish-brown stamens. This flower appears to be from the genus *Gazania*, commonly known as treasure flowers or gazanias. These flowers are known for their daisy-like appearance and bright colors.

Gazania flowers typically have ray florets (the petal-like structures) that can change color based on light conditions—often appearing more vibrant in direct sunlight. They are popular in gardens for their hardiness and ability to thrive in sunny locations with well-drained soil.

If there's anything specific about this flower or its care that interests you further, feel free to ask!

  1. Start the server
mistralrs serve vision -p 1234 -m mistralai/Mistral-Small-3.1-24B-Instruct-2503
  1. Send a request
from openai import OpenAI
import httpx
import textwrap
import json


client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")


completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/f/fd/Pink_flower.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is this?",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

This is a minimal example of running the Mistral 3 model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
        &model,
    )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
        arch=VisionArchitecture.Mistral3,
    ),
    in_situ_quant="4"
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is this?",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Phi 3.5 Model: microsoft/Phi-3.5-MoE-instruct

The Phi 3.5 MoE model is a 16x3.8B parameter decoder-only text-to-text mixture of expert LLM.

  • Context length of 128k tokens
  • Trained on 4.9T tokens
  • 16 experts (16x3.8B parameters) with 6.6B active parameters
  • Expect inference performance of a 7B model

About the MoE mechanism

  1. Compute router gating logits
  2. From the router gating logits, select the top-2 selected experts and the associated weights
  3. The hidden states for each token in the sequence is computed by (if selected) applying the expert output to that token, and then weighting it.
    • If multiple experts are selected for the token, then this becomes a weighted sum
    • The design is flexible: 2 or 1 experts can be selected, enabling dense or sparse gating
mistralrs run --isq 4 -m microsoft/Phi-3.5-MoE-instruct

Note

This models supports MoQE which can be activated in the ISQ organization parameter within the various APIs, as demonstrated below:

mistralrs run --isq 4 -m microsoft/Phi-3.5-MoE-instruct --isq-organization moqe

HTTP API

mistralrs serve --isq 4 -p 1234 -m microsoft/Phi-3.5-MoE-instruct
import openai

messages = []
prompt = input("Enter system prompt >>> ")
if len(prompt) > 0:
    messages.append({"role": "system", "content": prompt})


while True:
    prompt = input(">>> ")
    messages.append({"role": "user", "content": prompt})
    completion = client.chat.completions.create(
        model="default",
        messages=messages,
        max_tokens=256,
        frequency_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
    resp = completion.choices[0].message.content
    print(resp)
    messages.append({"role": "assistant", "content": resp})

Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="microsoft/Phi-3.5-MoE-instruct",
        arch=Architecture.Phi3_5MoE ,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-MoE-instruct")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Phi 3 Vision Model: microsoft/Phi-3.5-vision-instruct

The Phi 3 Vision Model has support in the Rust, Python, and HTTP APIs. The Phi 3 Vision Model supports ISQ for increased performance.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: The Phi 3 Vision model works best with one image although it is supported to send multiple images.

Note: when sending multiple images, they will be resized to the minimum dimension by which all will fit without cropping. Aspect ratio is not preserved in that case.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

The image captures a breathtaking view of a mountain peak, bathed in the soft glow of sunlight. The peak, dusted with a layer of snow, stands tall against the backdrop of a clear blue sky. A trail, etched into the mountain's side by countless hikers before it, winds its way up to the summit. The trail's white color contrasts sharply with the surrounding landscape, drawing attention to its path and inviting exploration.

The perspective from which this photo is taken offers an expansive view of the mountain and its surroundings. It seems as if one could look down from this vantage point and see miles upon miles of untouched wilderness stretching out into the distance. The colors in the image are predominantly blue and white, reflecting both sky and snow-covered mountains respectively. However, there are also hints of green from trees dotting lower parts of mountainsides or valleys below them - adding another layer to this picturesque scene. This serene landscape evokes feelings of tranquility and adventure at once - an invitation to explore nature's grandeur while respecting its majesty at all times!

  1. Start the server
mistralrs serve vision -p 1234 -m microsoft/Phi-3.5-vision-instruct
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

This is a minimal example of running the Phi 3 Vision model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("microsoft/Phi-3.5-vision-instruct")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_phiv_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="microsoft/Phi-3.5-vision-instruct",
        arch=VisionArchitecture.Phi3V,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/e/e7/ Everest_North_Face_toward_Base_Camp_Tibet_Luca_Galuzzi_2006.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Phi 4 Multimodal Model: microsoft/Phi-4-multimodal-instruct

The Phi 4 Multimodal Model has support in the Rust, Python, and HTTP APIs. The Phi 4 Multimodal Model supports ISQ for increased performance.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: The Phi 4 Multimodal model works best with one image although it is supported to send multiple images.

Note: when sending multiple images, they will be resized to the minimum dimension by which all will fit without cropping. Aspect ratio is not preserved in that case.

Phi 4 multimodal supports audio inputs!.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image: Mount Washington

Credit

Prompt:

What is shown in this image? Write a detailed response analyzing the scene.

Output:

A mountain with snow on it.

  1. Start the server
mistralrs serve vision -p 1234 -m microsoft/Phi-4-multimodal-instruct
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

This is a minimal example of running the Phi 4 Multimodal model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("microsoft/Phi-4-multimodal-instruct")
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://cdn.britannica.com/45/5645-050-B9EC0205/head-treasure-flower-disk-flowers-inflorescence-ray.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
        &model,
    )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="microsoft/Phi-4-multimodal-instruct",
        arch=VisionArchitecture.Phi4MM,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/e/e7/Everest_North_Face_toward_Base_Camp_Tibet_Luca_Galuzzi_2006.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Audio input

Alongside vision, Phi 4 Multimodal in mistral.rs can accept audio as an additional modality. This unlocks fully-local pipelines such as text + speech + vision → text where the model can reason jointly over what it hears and what it sees.

mistral.rs automatically decodes the supplied audio (WAV/MP3/FLAC/OGG/… – anything Symphonia can handle) into 16-bit PCM.

OpenAI HTTP API

Audio is delivered with the audio_url content-type that mirrors OpenAIʼs official specification:

{
  "role": "user",
  "content": [
    {
      "type": "audio_url",
      "audio_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg" }
    },
    {
      "type": "image_url",
      "image_url": { "url": "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg" }
    },
    {
      "type": "text",
      "text": "Describe what is happening in this clip in as much detail as possible."
    }
  ]
}

Rust SDK

use anyhow::Result;
use mistralrs::{AudioInput, IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new("microsoft/Phi-4-multimodal-instruct")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let audio_bytes = reqwest::blocking::get(
        "https://upload.wikimedia.org/wikipedia/commons/4/42/Bird_singing.ogg",
    )?
    .bytes()?
    .to_vec();
    let audio = AudioInput::from_bytes(&audio_bytes)?;

    let image_bytes = reqwest::blocking::get(
        "https://www.allaboutbirds.org/guide/assets/og/528129121-1200px.jpg",
    )?
    .bytes()?
    .to_vec();
    let image = image::load_from_memory(&image_bytes)?;

    let messages = VisionMessages::new()
        .add_multimodal_message(
            TextMessageRole::User,
            "Describe in detail what is happening.",
            vec![image],
            vec![audio],
            &model,
        )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    Ok(())
}

With this, you now have a single-call pipeline that fuses sound, vision, and text – all running locally through mistral.rs! 🔥

Qwen 2 Vision Model: Qwen2-VL Collection

Mistral.rs supports the Qwen2-VL vision model family, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.

UQFF quantizations are also available.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. The text model has 28 layers.

ToC

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

  1. Start up interactive mode with the Qwen2-VL model
mistralrs run vision -m Qwen/Qwen2-VL-2B-Instruct
  1. Say hello!
> Hello!
Hello! How can I assist you today?
  1. Pass the model an image and ask a question.
> Hello!
Hello! How can I assist you today?
> \image https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg What type of flower is this? Give some fun facts.
flowers are a type of flowering plant that produce flowers that are typically used for decoration, pollination, and reproduction. there are many different types of flowers, each with its own unique characteristics and uses. here are some fun facts about camellias:

  * camellias are native to china and have been cultivated for over 2,000 years.
  * camellias are known for their long blooming season, with some varieties blooming continuously for months.
  * camellias come in a wide variety of colors, including red, pink, white, and yellow.
  * camellias are also known for their fragrant blooms, which can be enjoyed by both humans and animals.
  * camellias are often used in gardens and parks as a decorative element, and are also popular in landscaping and horticulture.

camellias are also known for their resilience and ability to thrive in a variety of conditions, making them a popular choice for gardeners and landscapers. they require well-draining soil and full sun or partial shade, and can be grown in containers or in the ground. overall, camellias are a beautiful and versatile flower that can add beauty and interest to any garden or landscape.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.


Image: Mount Washington

Prompt:

What type of flower is this? Give some fun facts.

Output:

flowers are a beautiful addition to any garden or outdoor space. They come in many different colors and shapes, and can be used for both decorative purposes and as sources of pollination for bees and other insects.

One fun fact about camellias is that they are native to Japan, but were introduced to Europe in the 17th century by Portuguese sailors who brought them back from their voyages around the world. Camellias have been popular as ornamental plants since then, with many varieties available for cultivation.

Camellias also have interesting cultural significance in Japan, where they are often associated with good fortune and prosperity. In Chinese culture, camellias symbolize longevity and immortality.
In conclusion, camellias are beautiful flowers that add color and interest to gardens or outdoor spaces. They come in many different colors and shapes, making them a popular choice for gardeners everywhere!

  1. Start the server
mistralrs serve vision -p 1234 -m Qwen/Qwen2-VL-2B-Instruct
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What type of flower is this? Give some fun facts.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "Qwen/Qwen2-VL-2B-Instruct";

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new(MODEL_ID)
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What type of flower is this? Give some fun facts.",
        image,
        &model
    )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

MODEL_ID = "Qwen/Qwen2-VL-2B-Instruct"

runner = Runner(
    which=Which.VisionPlain(
        model_id=MODEL_ID,
        arch=VisionArchitecture.Qwen2VL,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What type of flower is this? Give some fun facts.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Qwen 3 Vision Model: Qwen3 VL Collection

The Qwen 3 VL models are the successors to the Qwen 2.5 VL models, featuring a diverse lineup of increased performance, flexible sizes, and reasoning-capable models.

Note: Support for the MoE variants is not yet implemented. This is coming very soon!

Mistral.rs supports the Qwen 3 VL vision model family, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.

UQFF quantizations are also available.

The Python and HTTP APIs support sending images as:

  • URL
  • Path to a local image
  • Base64 encoded string

The Rust SDK takes an image from the image crate.

Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters.

ToC

Interactive mode

Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model.

Start up interactive mode with the Qwen3 VL model:

mistralrs run vision -m Qwen/Qwen3-VL-4B-Instruct

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

  1. Start the server
mistralrs serve vision -p 1234 -m Qwen/Qwen3-VL-4B-Instruct
  1. Send a request
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

completion = client.chat.completions.create(
    model="default",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "What type of flower is this? Give some fun facts.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)


Rust

You can find this example here.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new("Qwen/Qwen3-VL-4B-Instruct")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let bytes = match reqwest::blocking::get(
        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_image_message(
        TextMessageRole::User,
        "What is this?",
        vec![image],
        &model,
    )?;

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

MODEL_ID = "Qwen/Qwen3-VL-4B-Thinking"

runner = Runner(
    which=Which.VisionPlain(
        model_id=MODEL_ID,
        arch=VisionArchitecture.Qwen3VL,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.garden-treasures.com/cdn/shop/products/IMG_6245.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "What type of flower is this? Give some fun facts.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

FLUX.1 Model: black-forest-labs/FLUX.1-schnell

The FLUX model is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions.

We support both the -schnell and -dev versions of the model.

Memory usage

The FLUX model itself is 12 billion parameters (~24GB), and the T5 XXL encoder model it uses requires ~9GB. We support loading the models fully onto the GPU, which allows much faster inference. If you do not have enough memory, try the offloaded (-offloaded or -Offloaded) model types. These will load the model on the CPU but perform computations on the GPU.

TypeMemory requirementGeneration Time (s), A100
Normal~33GB9.4
Offloaded~4GB92.7

HTTP server

The OpenAI HTTP server provides a compatible way to easily use this implementation. As per the specification, output images can be returned as local paths to images or be encoded to base64.

mistralrs serve diffusion -p 1234 -m black-forest-labs/FLUX.1-schnell -a flux

After this, you can send requests via the HTTP server:

from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

result = client.images.generate(
    model="default",
    prompt="A vibrant sunset in the mountains, 4k, high quality.",
    n=1,
)
print(result.data[0].url)

Rust example

use std::time::Instant;

use anyhow::Result;
use mistralrs::{DiffusionLoaderType, DiffusionModelBuilder, ImageGenerationResponseFormat};

#[tokio::main]
async fn main() -> Result<()> {
    let model = DiffusionModelBuilder::new(
        "black-forest-labs/FLUX.1-schnell",
        DiffusionLoaderType::FluxOffloaded,
    )
    .with_logging()
    .build()
    .await?;

    let start = Instant::now();

    let response = model
        .generate_image(
            "A vibrant sunset in the mountains, 4k, high quality.".to_string(),
            ImageGenerationResponseFormat::Url,
        )
        .await?;

    let finished = Instant::now();

    println!(
        "Done! Took {} s. Image saved at: {}",
        finished.duration_since(start).as_secs_f32(),
        response.data[0].url.as_ref().unwrap()
    );

    Ok(())
}

Python example

from mistralrs import (
    Runner,
    Which,
    DiffusionArchitecture,
    ImageGenerationResponseFormat,
)

runner = Runner(
    which=Which.DiffusionPlain(
        model_id="black-forest-labs/FLUX.1-schnell",
        arch=DiffusionArchitecture.FluxOffloaded,
    ),
)

res = runner.generate_image(
    "A vibrant sunset in the mountains, 4k, high quality.",
    ImageGenerationResponseFormat.Url,
)
print(res.choices[0].url)

Dia 1.6b Model: nari-labs/Dia-1.6B

Dia is a 1.6B parameter text to speech model created by Nari Labs. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.

  • Generate dialogue via the [S1] and [S2] tags
  • Generate non-verbal like (laughs), (coughs), etc.
  • Below verbal tags will be recognized, but might result in unexpected output. (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)

Note: voice cloning support is coming!

HTTP server

The OpenAI HTTP server provides a drop-in compatible way to easily use Dia locally!

Note: we only support pcm and wav outputs.

mistralrs run speech -m nari-labs/Dia-1.6B -a dia

After this, you can send requests via the HTTP server:

from pathlib import Path
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

# text_to_speak = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
text_to_speak = "[S1] mistral r s is a local LLM inference engine. [S2] You can run text and vision models, and also image generation and speech generation. [S1] There is agentic web search, tool calling, and a convenient Python SDK. [S2] Check it out on github."

response = client.audio.speech.create(
    model="default", voice="N/A", input=text_to_speak, response_format="wav"
)

output_path = Path("output.wav")
output_path.write_bytes(response.read())
print(f"WAV audio written to {output_path.resolve()}")

Rust example

use std::time::Instant;

use anyhow::Result;
use mistralrs::{speech_utils, SpeechLoaderType, SpeechModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = SpeechModelBuilder::new("nari-labs/Dia-1.6B", SpeechLoaderType::Dia)
        .with_logging()
        .build()
        .await?;

    let start = Instant::now();

    // let text_to_speak = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face.";
    let text_to_speak = "[S1] mistral r s is a local LLM inference engine. [S2] You can run text and vision models, and also image generation and speech generation. [S1] There is agentic web search, tool calling, and a convenient Python SDK. [S2] Check it out on github.";

    let (pcm, rate, channels) = model.generate_speech(text_to_speak).await?;

    let finished = Instant::now();

    let mut output = std::fs::File::create("out.wav").unwrap();
    speech_utils::write_pcm_as_wav(&mut output, &pcm, rate as u32, channels as u16).unwrap();

    println!(
        "Done! Took {} s. Audio saved at `out.wav`.",
        finished.duration_since(start).as_secs_f32(),
    );

    Ok(())
}

Python example

from mistralrs import (
    Runner,
    Which,
    SpeechLoaderType,
)
from pathlib import Path
import wave, struct

# text_to_speak = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
text_to_speak = "[S1] mistral r s is a local LLM inference engine. [S2] You can run text and vision models, and also image generation and speech generation. [S1] There is agentic web search, tool calling, and a convenient Python SDK. [S2] Check it out on github."

runner = Runner(
    which=Which.Speech(
        model_id="nari-labs/Dia-1.6B",
        arch=SpeechLoaderType.Dia,
    ),
)

res = runner.generate_speech(text_to_speak)
print(res.choices[0].url)

pcm_data = res.pcm  # list of floats between -1.0 and 1.0
output_path = Path("output.wav")

# convert floats to 16-bit PCM ints
pcm_ints = [int(max(-32768, min(32767, int(sample * 32767)))) for sample in pcm_data]
with wave.open(output_path, "wb") as wf:
    wf.setnchannels(1)  # mono
    wf.setsampwidth(2 * res.channels)  # 2 bytes per sample (16-bit)
    wf.setframerate(res.rate)  # set sample rate (adjust if needed)
    wf.writeframes(b"".join(struct.pack("<h", s) for s in pcm_ints))

print(f"WAV audio written to {output_path.resolve()}")

EmbeddingGemma

EmbeddingGemma was the first embedding model supported by mistral.rs. This guide walks through serving the model via the OpenAI-compatible HTTP server, running it from Python, and embedding text directly in Rust.

For a catalog of available embedding models and general usage tips, see EMBEDDINGS.md.

Prompt instructions

EmbeddingGemma can generate optimized embeddings for various use cases-such as document retrieval, question answering, and fact verification-or for specific input types, either, a query or a document-using prompts that are prepended to the input strings.

  • Query prompts follow the form task: {task description} | query: where the task description varies by the use case, with the default task description being search result.
  • Document-style prompts follow the form title: {title | "none"} | text: where the title is either none (the default) or the actual title of the document. Note that providing a title, if available, will improve model performance for document prompts but may require manual formatting.
Use Case (task type enum)DescriptionsRecommended Prompt
Retrieval (Query)Used to generate embeddings that are optimized for document search or information retrieval.task: search result | query: {content}
Retrieval (Document)Used to generate embeddings that are optimized for document search or information retrieval (document side).title: {title | "none"} | text: {content}
Question AnsweringUsed to generate embeddings that are optimized for answering natural language questions.task: question answering | query: {content}
Fact VerificationUsed to generate embeddings that are optimized for verifying factual correctness.task: fact checking | query: {content}
ClassificationUsed to generate embeddings that are optimized to classify texts according to preset labels.task: classification | query: {content}
ClusteringUsed to generate embeddings that are optimized to cluster texts based on their similarities.task: clustering | query: {content}
Semantic SimilarityUsed to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases.task: sentence similarity | query: {content}
Code RetrievalUsed to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list. Embeddings of code blocks are computed using retrieval_document.task: code retrieval | query: {content}

HTTP server

Launch the server in embedding mode to expose an OpenAI-compatible /v1/embeddings endpoint:

mistralrs serve -p 1234 -m google/embeddinggemma-300m

Once running, call the endpoint with an OpenAI client or raw curl:

curl http://localhost:1234/v1/embeddings \
  -H "Authorization: Bearer EMPTY" \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "input": ["task: search result | query: What is graphene?", "task: search result | query: What is an apple?"]}'

An example with the OpenAI client can be found here.

By default the server registers the model as default. To expose it under a custom name or alongside chat models, run in multi-model mode and assign an identifier in the selector configuration:

{
  "embed-gemma": {
    "Embedding": {
      "model_id": "google/embeddinggemma-300m",
      "arch": "embeddinggemma"
    }
  }
}

See docs/HTTP.md for the full request schema and response layout.

Python SDK

Instantiate Runner with the Which.Embedding selector and request EmbeddingGemma explicitly. The helper method send_embedding_request returns batched embeddings as Python lists.

from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which

runner = Runner(
    which=Which.Embedding(
        model_id="google/embeddinggemma-300m",
        arch=EmbeddingArchitecture.EmbeddingGemma,
    )
)

request = EmbeddingRequest(
    input=["task: search result | query: What is graphene?", "task: search result | query: What is an apple?"],
    truncate_sequence=True,
)

embeddings = runner.send_embedding_request(request)
print(len(embeddings), len(embeddings[0]))

Refer to this example for a complete runnable script.

Rust SDK

Use the EmbeddingModelBuilder helper from the mistralrs crate to create the model and submit an EmbeddingRequest:

use anyhow::Result;
use mistralrs::{EmbeddingModelBuilder, EmbeddingRequest};

#[tokio::main]
async fn main() -> Result<()> {
    let model = EmbeddingModelBuilder::new("google/embeddinggemma-300m")
        .with_logging()
        .build()
        .await?;

    let embeddings = model
        .generate_embeddings(
            EmbeddingRequest::builder()
                .add_prompt("task: search result | query: What is graphene?")
        )
        .await?;

    println!("Returned {} vectors", embeddings.len());
    Ok(())
}

This example lives here, and can be run with:

cargo run --package mistralrs --example embedding_gemma

Qwen3 Embedding

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks.

For a catalog of all embedding backends, see EMBEDDINGS.md.

HTTP server

Serve the model with the OpenAI-compatible endpoint enabled:

mistralrs serve -p 1234 -m Qwen/Qwen3-Embedding-0.6B

Call the endpoint via curl or the OpenAI SDK:

curl http://localhost:1234/v1/embeddings \
  -H "Authorization: Bearer EMPTY" \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "input": ["Graphene conductivity", "Explain superconductors in simple terms."]}'

An example with the OpenAI client can be found here.

To expose the model alongside chat models, register it in your selector configuration using the qwen3embedding architecture tag:

{
  "embed-qwen3": {
    "Embedding": {
      "model_id": "Qwen/Qwen3-Embedding-0.6B",
      "arch": "qwen3embedding"
    }
  }
}

See docs/HTTP.md for the full request schema.

Python SDK

Instantiate Runner with the embedding selector and request Qwen3 explicitly. The output mirrors the OpenAI embeddings array shape:

from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which

runner = Runner(
    which=Which.Embedding(
        model_id="Qwen/Qwen3-Embedding-0.6B",
        arch=EmbeddingArchitecture.Qwen3Embedding,
    )
)

request = EmbeddingRequest(
    input=["Graphene conductivity", "Explain superconductors in simple terms."],
    truncate_sequence=True,
)

embeddings = runner.send_embedding_request(request)
print(len(embeddings), len(embeddings[0]))

A ready-to-run version can be found at examples/python/qwen3_embedding.py.

Rust SDK

Use the EmbeddingModelBuilder helper just like with EmbeddingGemma. The example below mirrors the repository sample:

use anyhow::Result;
use mistralrs::{EmbeddingModelBuilder, EmbeddingRequest};

#[tokio::main]
async fn main() -> Result<()> {
    let model = EmbeddingModelBuilder::new("Qwen/Qwen3-Embedding-0.6B")
        .with_logging()
        .build()
        .await?;

    let embeddings = model
        .generate_embeddings(
            EmbeddingRequest::builder()
                .add_prompt("What is graphene?")
                .add_prompt("Explain superconductors in simple terms.")
        )
        .await?;

    println!("Returned {} vectors", embeddings.len());
    Ok(())
}

You can find the full example at mistralrs/examples/qwen3_embedding/main.rs.

Quantization in mistral.rs

Mistral.rs supports the following quantization:

  • ⭐ ISQ (read more detail)
    • Supported in all plain/vision and adapter models
    • Works on all supported devices
    • Automatic selection to use the fastest and most accurate method
    • Supports:
      • Q, K type GGUF quants
      • AFQ
      • HQQ
      • FP8
  • GGUF/GGML
    • Q, K type
    • Supported in GGUF/GGML and GGUF/GGML adapter models
    • Supported in all plain/vision and adapter models
    • Imatrix quantization is supported
    • I quants coming!
    • CPU, CUDA, Metal (all supported devices)
    • 2, 3, 4, 5, 6, 8 bit
  • GPTQ (convert with this script)
    • Supported in all plain/vision and adapter models
    • CUDA only
    • 2, 3, 4, 8 bit
    • Marlin kernel support in 4-bit and 8-bit.
  • AWQ (convert with this script)
    • Supported in all plain/vision and adapter models
    • CUDA only
    • 4 and 8 bit
    • Marlin kernel support in 4-bit and 8-bit.
  • HQQ
    • Supported in all plain/vision and adapter models via ISQ
    • 4, 8 bit
    • CPU, CUDA, Metal (all supported devices)
  • FP8
    • Supported in all plain/vision and adapter models
    • CPU, CUDA, Metal (all supported devices)
  • BNB
    • Supported in all plain/vision and adapter models
    • bitsandbytes int8, fp4, nf4 support
  • AFQ
    • 2, 3, 4, 6, 8 bit
    • 🔥 Designed to be fast on Metal!
    • Only supported on Metal.
  • MLX prequantized
    • Supported in all plain/vision and adapter models

Using a GGUF quantized model

  • Use the gguf (cli) / GGUF (Python) model selector
  • Provide the GGUF file
mistralrs run --format gguf -f my-gguf-file.gguf

Using ISQ

See the docs

mistralrs run --isq 4 -m microsoft/Phi-3-mini-4k-instruct

Using a GPTQ quantized model

  • Provide the model ID for the GPTQ model
  • Mistral.rs will automatically detect and use GPTQ quantization for plain and vision models!
  • The Marlin kernel will automatically be used for 4-bit and 8-bit.
mistralrs run -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bit

You can create your own GPTQ model using [scripts/convert_to_gptq.py][../scripts/convert_to_gptq.py]:

pip install gptqmodel transformers datasets

python3 scripts/convert_to_gptq.py --src path/to/model --dst output/model/path --bits 4

Using a MLX prequantized model (on Metal)

  • Provide the model ID for the MLX prequantized model
  • Mistral.rs will automatically detect and use quantization for plain and vision models!
  • Specialized kernels will be used to accelerate inference!
mistralrs run -m mlx-community/Llama-3.8-1B-8bit

In situ quantization

In situ quantization works by quantizing models inplace, with the chief benefit being reduced memory footprint when running the model. This enables larger model to be run on devices which would not fit the full weights, and may increase model inference performance.

Quick start: Just use --isq 4 (or 2, 3, 5, 6, 8) and mistral.rs will pick the best quantization for your hardware:

mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct

An API is exposed on the Python and Rust SDKs which provides the ability to dynamically re-ISQ models at runtime.

To set the ISQ type for individual layers, use a model topology.

Note: 🔥 AFQ (affine) quantization is designed to be fast on Metal but is only supported on Metal.

Automatic ISQ (just use a number!)

Instead of specifying a quantization type like Q4K, you can just pass an integer (2, 3, 4, 5, 6, or 8) and mistral.rs will automatically select the best quantization method for your platform.

On Metal, this uses fast AFQ quantization (for 2, 3, 4, 6, or 8 bits). On other platforms, it falls back to Q/K quantization.

mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct

ISQ quantization types

  • AFQ2 (AFQ is only available on Metal)
  • AFQ3
  • AFQ4
  • AFQ6
  • AFQ8
  • Q4_0
  • Q4_1
  • Q5_0
  • Q5_1
  • Q8_0
  • Q8_1 (not available on CUDA)
  • Q2K
  • Q3K
  • Q4K
  • Q5K
  • Q6K
  • Q8K (not available on CUDA)
  • HQQ4
  • HQQ8
  • FP8
mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct

When using ISQ, it will automatically load ISQ-able weights into CPU memory before applying ISQ. The ISQ application process moves the weights to device memory. This process is implemented to avoid memory spikes from loading the model in full precision.

For Mixture of Expert models, a method called MoQE can be applied to only quantize MoE layers. This is configured via the ISQ “organization” parameter in all APIs. The following models support MoQE:

Accuracy

Accuracy of ISQ can be measured by the performance degradation versus the unquantized model. This is commonly measured with perplexity. Please see the perplexity example.

To improve the accuracy of a model with ISQ, use an imatrix file. These can be found online (for example, on Hugging Face), and should be passed with the --imatrix flag for plain models. This will increase the accuracy of the quantization significantly and bring the ISQ quantization up to par with the GGUF counterpart.

Check out the imatrix docs.

Python Example

runner = Runner(
    which=Which.Plain(
        model_id="Qwen/Qwen3-0.6B",
    ),
    in_situ_quant="4",
)

Rust Example

You can find this example here.

#![allow(unused)]
fn main() {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
    .with_isq(IsqType::Q8_0)
    .with_logging()
    .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
    .build()
    .await?;
}

Server example

mistralrs serve --port 1234 --isq 4 -m mistralai/Mistral-7B-Instruct-v0.1

Or with a specific quantization type:

mistralrs serve --port 1234 --isq Q4K -m mistralai/Mistral-7B-Instruct-v0.1

Universal Quantized File Format: UQFF

The uniquely powerful quantized file format.

  1. Flexible 🌀: Multiple quantization formats in one file format with one framework to run them all.
  2. Reliable 🔒: Compatibility ensured with embedded and checked semantic versioning information from day 1.
  3. Easy 🤗: Download UQFF models easily and quickly from Hugging Face, or use a local file.
  4. Customizable 🛠️: Make and publish your own UQFF files in minutes.

ToC

Motivation

UQFF builds on our ISQ feature by allowing serialization and deserialization for models.

While ISQ is a powerful feature enabling easy quantization of models, the key limitation has been the time required for requantization. While the process is relatively fast with parallelization and other techniques, multiple runs can make the experience slow.

Comparting UQFF to GGUF:

In contrast to GGUF, which only supports the GGUF quantizations, UQFF is designed with flexibiliuty in mind. At its code, it extends the power and flexibility of ISQ. The ability to support multiple quantization types (more to come!) in one simple, easy-to-use file is a critical feature.

Additionally, users will no longer need to wait for GGUF support to begin using post-training quantized models. As we add new models and quantization schemes to mistral.rs, the feature set of UQFF will grow.

Support

The following quantization formats are supported in UQFF. One can, of course, be combined arbitrarily during UQFF generation or ISQ using a model topology. When loading a UQFF model, only the per-layer device mapping feature of the topology applies.

  • GGUF quantized:

    • Q4_0
    • Q4_1
    • Q5_0
    • Q5_1
    • Q8_0
    • Q8_1 (not available on CUDA)
    • Q2K
    • Q3K
    • Q4K
    • Q5K
    • Q6K
    • Q8K (not available on CUDA)
  • HQQ quantized:

    • HQQ4
    • HQQ8
  • FP8:

    • FP8 E4M3 (4-bit exponent, 3-bit mantissa)
  • AFQ quantized (🔥 AFQ is fast on Metal):

    • AFQ2
    • AFQ3
    • AFQ4
    • AFQ6
    • AFQ8

Loading a UQFF model

To load a UQFF model, one should specify the filename. This will be located based on the model ID, and can be loaded locally or from Hugging Face based on the model ID.

  • phi3.5-mini-instruct-q4k.uqff
  • ../UQFF/phi3.5-mini-instruct-q4k.uqff

You can find a collection of UQFF models here, which each include a simple command to get started.

Note: when loading an UQFF model, any ISQ setting will be ignored.

Running with the CLI

mistralrs run -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-f8e4m3.uqff

Using with the Rust SDK

Check out the following examples:

Using the Python SDK

Modify the Which instantiation as follows:

Which.Plain(
    model_id="EricB/Phi-3.5-mini-instruct-UQFF",
+   from_uqff="phi3.5-mini-instruct-q4k.uqff"
),

Using topology for device mapping with UQFF

When loading a UQFF model, the quantization is already baked in, so ISQ settings in the topology are ignored. However, device mapping from a topology file still applies. This is useful for splitting a pre-quantized model across multiple GPUs or offloading layers to CPU.

CLI example:

mistralrs run -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-q4k.uqff --topology device_map.yml

Topology file for device mapping only (device_map.yml):

0-16:
  device: cuda[0]
16-32:
  device: cuda[1]

Rust SDK example:

#![allow(unused)]
fn main() {
use mistralrs::{UqffTextModelBuilder, Topology, LayerTopology, Device};

let model = UqffTextModelBuilder::new(
    "EricB/Phi-3.5-mini-instruct-UQFF",
    vec!["phi3.5-mini-instruct-q4k.uqff".into()],
)
.into_inner()
.with_topology(
    Topology::empty()
        .with_range(0..16, LayerTopology { isq: None, device: Some(Device::Cuda(0)) })
        .with_range(16..32, LayerTopology { isq: None, device: Some(Device::Cuda(1)) })
)
.build()
.await?;
}

Python SDK example:

runner = Runner(
    which=Which.Plain(
        model_id="EricB/Phi-3.5-mini-instruct-UQFF",
        from_uqff="phi3.5-mini-instruct-q4k.uqff",
        topology="device_map.yml",
    ),
)

Note: The isq field in topology entries is ignored when loading UQFF models since quantization is pre-applied.

Creating a UQFF model

Creating a UQFF model requires you to generate the UQFF file.

  • This means specifying a local path to a file ending in .uqff, where your new UQFF model will be created.
  • The quantization of a UQFF model is determined from the ISQ or model topology (see the topology docs for more details on how ISQ and the topology mix).

Along with the UQFF file, the generation process will also output several .json configuration files and residual.safetensors. All of these files are considered the UQFF model, and should be kept together or uploaded.

Note: Only the .uqff files are unique to the quantization level(s). If you are generating multiple UQFF files, it is OK for the others to be overwritten.

After creating the UQFF file, you can upload the model to Hugging Face. To do this:

  1. Create a new model.
  2. Upload the UQFF file:
  3. Locally, generate the model card file with this Python script..
  4. In the web interface, press the Create Model Card button and paste the generated model card.

⭐ Check out uqff_maker to make UQFF models with an easy CLI!

mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq 4 -o phi3.5-mini-instruct-q4k.uqff

Upload with Git

To upload a UQFF model using Git, you will most likely need to set up Git LFS:

  1. Install git-lfs
  2. Run git lfs install
  3. (If the files are larger than 5GB) Run huggingface-cli lfs-enable-largefiles . (you will need to pip install huggingface_hub)

After this, you can use Git to track, commit, and push files.

List of models

You can find a list of models in the Hugging Face model collection.

Have you created a UQFF model on Hugging Face? If so, please create an issue.

UQFF internal structure

The following describes the exact memory layout of UQFF tensors of version 0.1.0.

ToC

GGUF quantization

IDElement typeEndianness
UQFF versionu32little endian
ISQ type (0)u8little endian
Tensor data length in bytesu32little endian
Whether bias data is included (boolean)u8little endian
Quantized dtypeu32little endian
Num shape dimsu32little endian
Array quantized weight shape dimsu32little endian
Array quantized weight datau8little endian
[Optional] Array Bias tensor data, see docsSee docsSee docs

Unquantized layers

IDElement typeEndianness
UQFF versionu32little endian
ISQ type (1)u8little endian
Whether bias data is included (boolean)u8little endian
Array Weight tensor data, see docsSee docsSee docs
[Optional] Array Bias tensor data, see docsSee docsSee docs

FP8 layers

IDElement typeEndianness
UQFF versionu32little endian
ISQ type (1)u8little endian
Whether bias data is included (boolean)u8little endian
Array Weight tensor data, see docsSee docsSee docs
Dequant W scalarf32little endian
Dequant X scalarf32little endian
Quant scalarf32little endian
Quantization typeu32little endian
[Optional] Array Bias tensor data, see docsSee docsSee docs

HQQ quantization

IDElement typeEndianness
UQFF versionu32little endian
ISQ type (2)u8little endian
Whether bias data is included (boolean)u8little endian
Array Q weight, see docsSee docsSee docs
Array Q scale, see docsSee docsSee docs
Array Q zeroes, see docsSee docsSee docs
Dequant weight num shape dimsu32little endian
Array dequant weight shape dimsu32little endian
CFG bitsu8little endian
CFG group sizeu32little endian
CFG axisu8little endian
CFG optimization steps (0 means Option::None for now)u32little endian
CFG round zeroes (boolean)u8little endian
CFG channel wise (boolean)u8little endian

FP8 layers

IDElement typeEndianness
UQFF versionu32little endian
ISQ type (3)u8little endian
Whether bias data is included (boolean)u8little endian
Array Weight tensor data, see docsSee docsSee docs
Dequant scale Wf32little endian
Dequant scale Xf32little endian
Quant scalef32little endian
Layer dtypeu32little endian
[Optional] Array Bias tensor data, see docsSee docsSee docs

Standard tensors

IDElement typeEndianness
Tensor data length in bytesu32little endian
Tensor dtypeu32little endian
Num shape dimsu32little endian
Array shape dimsu32little endian
Array flattened (contiguous) tensor datau8little endian

Model topology configuration

Quantization and device mapping in one file.

Note

Manual device mapping flags are deprecated in favor of automatic placement because it is easy to misconfigure them. Topology files remain the preferred way to express per-layer quantization, and you can still provide device overrides here when you truly need to. Those overrides win over the automatic mapper, so apply them sparingly. See the device mapping documentation for guidance.

Use a simple model topology to configure ISQ and device mapping for per-layer with a single YAML file (examples here)!

To support per-layer mix of ISQ, Mistral.rs supports loading a model topology YAML file. This YAML file is formatted as follows:

  1. Top-level keys are either:
    • A range of layers (start-end) where start < end. start is inclusive and end is exclusive
    • A single layer number
    1. The topology for the range or layer:
      • An optional key (isq) which maps to a single value, which can be any ISQ type. If not specified, there is no ISQ for this range of layers applied.
      • An optional key (device) which maps to a single value, which is one of the below. If not specified, the default loading deice will be used.
        • cpu
        • cuda[ORDINAL]
        • metal[ORDINAL]

Note that:

  • The topology for the range is expanded to fill the range
  • If ranges overlap, the range with the higher end layer takes precedence. When two ranges share the same end layer, the one that appears later in the topology file wins.
  • Any layers which are not covered will have no topology mapping. They will inherit any other ISQ (e.g. with --isq/in_situ_quant) set.
  • Unless the layer is not covered by the topology, the topology value will override any other ISQ (e.g. with --isq/in_situ_quant).
  • The topology device mapping will override any other device mapping.

Using topology with UQFF models

When loading a UQFF model, the quantization is already applied during UQFF creation. Therefore:

  • ISQ settings in the topology are ignored - the pre-quantized weights are used as-is
  • Device mapping still applies - you can split layers across GPUs or offload to CPU

This is useful for deploying pre-quantized models across multiple devices without re-quantizing.

Example topology for UQFF device mapping:

# Only device mapping is used; isq would be ignored
0-16:
  device: cuda[0]
16-32:
  device: cuda[1]

See the UQFF documentation for complete examples.

Regex selectors

Layer ranges are convenient when you know the numeric index, but you can also target weights by name. Keys wrapped in /.../ are interpreted as regular expressions that are matched against the fully qualified tensor name (for example, model.layers.3.attn.q_proj.weight). Regex selectors may override both isq and device.

'/attn\.q_proj$/':
  isq: Q4K
'/ffn_.*\.weight$/':
  isq: Q3K

Regex-based ISQ overrides are applied through the immediate ISQ system, so they quantize weights as they are loaded. Numeric layer ranges continue to be handled by the post-load topology pass. Regex selectors are evaluated top-to-bottom as they appear in the YAML file, so a selector that comes later in the file overrides earlier matches.

0-8:
  isq: Q3K
  device: cuda[0]
8-16:
  isq: Q4K
  device: cpu
16-24:
  isq: Q6K
# Skip 24-28
28-32:
  isq: Q8_0
  device: cuda[0]

Model topologies may be applied to all model types.

CLI example

mistralrs run -m microsoft/Phi-3-mini-128k-instruct --topology topologies/isq.yml

HTTP server example

mistralrs serve -p 1234 -m microsoft/Phi-3-mini-128k-instruct --topology topologies/isq.yml

Rust example

Example here.

Python example

Example here.

Enhancing ISQ with an imatrix

Mistral.rs supports enhancing the performance of models quantized with ISQ by collecting an imatix from calibration data. The following quantizations are supported with an imatrix:

  • Q2K
  • Q3K
  • Q4K
  • Q5K
  • Q6K

What is an imatrix? An imatrix (importance matrix) is generated from data collected during the execution of the model on calibration data. This data is used to enhance the performance of the model by enabling a weighted RMSE minimization when quantizing the tensor. For more information, see the original PR.

Using an imatrix causes the quantization process to take longer as the data must be collected, but there is no inference-time performance decrease.

Note: mistral.rs will automatically generate a .cimatrix file which can be used within mistral.rs as a replacement for a .imatrix file. The primary advantage is the in-situ generation within mistral.rs. The format is incompatible with llama.cpp.

To use this, simply specify the calibration data file in the various APIs as detailed below.

With the CLI

mistralrs run --isq 4 -m meta-llama/Llama-3.2-3B-Instruct --calibration-file calibration_data/calibration_datav3_small.txt

With the Rust SDK

You can find this example here.

#![allow(unused)]
fn main() {
let model = TextModelBuilder::new("meta-llama/Llama-3.2-3B-Instruct")
    .with_isq(IsqType::Q4K)
    .with_calibration_file("calibration_data/calibration_datav3_small.txt".into())
    .with_logging()
    .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
    .build()
    .await?;
}

With the Python SDK

You can find this example here.

runner = Runner(
    which=Which.Plain(
        model_id="meta-llama/Llama-3.2-3B-Instruct",
        calibration_file="calibration_data/calibration_datav3_small.txt"
    ),
    in_situ_quant="4",
)

Adapter model support

An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting an XLora* architecture, and LoRA support by selecting the Lora* architecture. For both X-LoRA and LoRA, an ordering file (see this section for preparing the ordering file) must be provided. The ordering file describes the ordering of layers and which adapters to use (and what order to use them in for X-LoRA).

When using an adapter model with a quantized base model, if the ordering file specifies unsupported layers you will receive an error.

Supported X-LoRA or LoRA quantized layers**

Llama architecture:

  • model.layers.{layer_idx}.self_attn.q_proj
  • model.layers.{layer_idx}.self_attn.k_proj
  • model.layers.{layer_idx}.self_attn.v_proj
  • model.layers.{layer_idx}.self_attn.o_proj
  • model.layers.{layer_idx}.mlp.up_proj
  • model.layers.{layer_idx}.mlp.down_proj
  • model.layers.{layer_idx}.mlp.gate_proj
  • lm_head

Phi 3 architecture:

  • model.layers.{layer_idx}.self_attn.qkv_proj
  • model.layers.{layer_idx}.self_attn.o_proj
  • model.layers.{layer_idx}.mlp.gate_up_proj
  • model.layers.{layer_idx}.mlp.down_proj
  • lm_head

Adapter ordering file

Preparing the X-LoRA/LoRA Ordering File The X-LoRA/LoRA ordering file is necessary to prepare before inference with an X-LoRA model. However, it is easy with a provided script!

X-LoRA case

An ordering JSON file for X-LoRA contains 2 major parts.

  1. The adapter names order
    • The order matters!
    • Should be an array of strings which are the adapter names corresponding to the order the adapters were specified during training. For example, if the adapters were specified as a dictionary:
  2. The layer ordering layers
    • Automatically generated and should not be manipulated as it controls the application of scalings.
adapters = {
    "math": ...,
    "reasoning": ...,
    "biology": ...
}

The specified order would be ["math", "reasoning", "biology"].

We provide an ordering file which contains the ordering for the X-LoRA model associated with the paper and the Huggingface repository: https://huggingface.co/lamm-mit/x-lora.

LoRA case

An ordering JSON file for LoRA contains 2 major parts:

  1. The adapter names order (optional):
    • The order does not matter
    • Come controls which adapters will be initially activated
    • If this key is not specified, then no adapters will be activated initially
  2. Preload adapter section preload_adapters (optional): see this section
    • Order does not matter
    • Specifies the adapter name and the model ID to find them, which may be a local path.

Preparing the ordering file (LoRA or X-LoRA cases)

There are 2 scripts to prepare the ordering file and which work for both X-LoRA and LoRA. The ordering file is specific to each architecture and set of target modules. Therefore, if either are changed, it is necessary to create a new ordering file using the first option. If only the adapter order or adapters changed, then the second option should be used.

  1. From scratch: No ordering file for the architecture and target modules

    A script create_ordering.py is provided which prompts the user for the model ID, target modules, and adapter names. The user is prompted for an output file location, relative to the working directory.

  2. Create a new ordering file from an existing ordering file for an architecture and target modules

    A script set_names.py is provided which prompts the user for the adapter names and the old ordering file. The user is prompted for an output file location, relative to the working directory.

Quantized X-LoRA or LoRA models

Mistral.rs supports running quantized models with X-LoRA or LoRA. The X-LoRA or LoRA adapter layers will not be quantized, only the base model. P

In the X-LoRA case, please note that using a high quantization level (eg., 4-bit) can distort the signal and prevent the classifier from acting properly. Therefore, it is better to use slightly lower levels such as 8-bit.

Avoiding the scaling pass with non-granular scalings

The X-LoRA implementation supports non-granular scalings. This caches the scalings after k completion tokens are generated and they will be used for the remaining passes avoiding the scaling pass. The number of tokens to generate before caching is defined by setting tgt_non_granular_index. Setting tgt_non_granular_index will restrict the maximum running sequences to 1.

Please see this page for more details and examples.

Adapter model dynamic adapter activation

We support dynamic adapter activation for LoRA models, allowing you to activate a set of adapters at runtime. There is a Python, Rust and HTTP API:

To use this feature, you should add a preload_adapters key to your ordering file:

{
    "order": ["..."],
    "layers": {"...": "123"},
    "base_model_id": "...",
+    "preload_adapters": [{"name": "...", "adapter_model_id": "..."}] # New field here
}

This allows mistral.rs to preload the adapter and enable runtime activation.

Examples of LoRA and X-LoRA models

  • X-LoRA with no quantization

To start an X-LoRA server with the exactly as presented in the paper:

mistralrs serve -p 1234 --xlora lamm-mit/x-lora --xlora-order orderings/xlora-paper-ordering.json
  • LoRA with a model from GGUF

To start a LoRA server with adapters from the X-LoRA paper (you should modify the ordering file to use only one adapter, as the adapter static scalings are all 1 and so the signal will become distorted):

mistralrs serve -p 1234 --format gguf -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q8_0.gguf --lora lamm-mit/x-lora

Normally with a LoRA model you would use a custom ordering file. However, for this example we use the ordering from the X-LoRA paper because we are using the adapters from the X-LoRA paper.

X-LoRA non-granular scalings

A key limitation of the X-LoRA architecture is the need for 2 forward passes of the model per generation step. To trade off model performance for speed, mistral.rs allows the user to reduce the granularity of the scalings by caching them in a technique we call Non Granular Scalings.

How it works

For the first $k$ generation steps, the scalings are calculated normally for each token. However, for the rest of the tokens, it is cached and re-used. In this way, we are able to avoid the second forward pass and the performance is increased significantly. To maintain correctness, enabling non-granular scalings will restrict the engine to processing one sequence at a time.

How to use it

Command line

This can be enabled by passing --tgt-non-granular-index followed by $k$:

mistralrs serve -p 1234 --xlora lamm-mit/x-lora --xlora-order orderings/xlora-paper-ordering.json --tgt-non-granular-index 5

Python

Set the tgt_non_granular_index attribute to a non-None value in the Which selection:

from mistralrs import Runner, Which

runner = Runner(
    which=Which.XLoraGGUF(
        tok_model_id=None,  # Automatically determine from ordering file
        quantized_model_id="TheBloke/zephyr-7B-beta-GGUF",
        quantized_filename="zephyr-7b-beta.Q4_0.gguf",
        xlora_model_id="lamm-mit/x-lora",
        order="orderings/xlora-paper-ordering.json",
        tgt_non_granular_index=5,
    )
)

...

Build a memory-efficient MoE model from anything, in seconds

AnyMoE is technique to dynamically and efficiently create MoE models. By providing a set of experts and a small pretraining dataset, you can create an MoE locally!

It has the following features:

  • Apply AnyMoE to any supported model
    • plain
    • vision-plain
  • Specify the layers to apply AnyMoE to for efficient training

Paper: https://arxiv.org/abs/2405.19076

https://github.com/EricLBuehler/mistral.rs/assets/65165915/33593903-d907-4c08-a0ac-d349d7bf33de

Note: By default, this has the capability to create an csv loss image. When building from source (for Python or CLI), you may use --no-default-features command line to disable this. This may be necessary if networking is unavailable.

Dataset

Currently, AnyMoE expects a JSON dataset with one top-level key row, which is an array of objects with keys prompt (string), expert (integer), and image_urls (optional array of strings). For example:

{
    "rows": [
        {
            "prompt": "Discuss the impact of Renaissance art on modern aesthetics",
            "expert": 0
        },
        {
            "prompt": "Explain the significance of the theory of relativity in modern physics",
            "expert": 1
        },
    ]
}
  

For a vision model, image_urls may contain an array of image URLs/local paths or Base64 encoded images.

Experts

AnyMoE experts can be either fine-tuned models or LoRA adapter models. Only the mlp layers will be loaded from each. The experts must be homogeneous: they must be all fine-tuned or all adapter. Additionally, certain layers can be specified to apply AnyMoE.

Note: When using LoRA adapter experts, it may not be necessary to set the layers where AnyMoE will be applied due to the lower memory usage.

Example of TOML selector with fine-tuned experts

[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"

[anymoe]
dataset_json = "examples/amoe.json"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["HuggingFaceH4/zephyr-7b-beta"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096
expert_type = "fine_tuned"

Example of TOML selector with LoRA adapter experts

[model]
model_id = "HuggingFaceH4/zephyr-7b-beta"
arch = "mistral"

[anymoe]
dataset_json = "examples/amoe.json"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["EricB/example_adapter"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096

[anymoe.config.expert_type.lora_adapter]
rank = 16
alpha = 16
target_modules = ["gate_proj"]

Examples

CLI

CLI usage is via the TOML selector where you can also find docs on the required fields.

For example, to use the demo fine-tuned expert:

mistralrs from-config --file toml-selectors/anymoe.toml

To use the demo LoRA expert:

mistralrs from-config --file toml-selectors/anymoe_lora.toml

Python example

from mistralrs import (
    Runner,
    Which,
    ChatCompletionRequest,
    Architecture,
    AnyMoeConfig,
    AnyMoeExpertType,
)

runner = Runner(
    which=Which.Plain(
        model_id="mistralai/Mistral-7B-Instruct-v0.1",
        arch=Architecture.Mistral,
    ),
    anymoe_config=AnyMoeConfig(
        hidden_size=4096,
        dataset_json="examples/amoe.json",
        prefix="model.layers",
        mlp="mlp",
        expert_type=AnyMoeExpertType.FineTuned(),
        lr=1e-3,
        epochs=100,
        batch_size=4,
        model_ids=["HuggingFaceH4/zephyr-7b-beta"],
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    AnyMoeConfig, AnyMoeExpertType, AnyMoeModelBuilder, IsqType, PagedAttentionMetaBuilder,
    TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let text_builder = TextModelBuilder::new("mistralai/Mistral-7B-Instruct-v0.1")
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?;

    let model = AnyMoeModelBuilder::from_text_builder(
        text_builder,
        AnyMoeConfig {
            hidden_size: 4096,
            lr: 1e-3,
            epochs: 100,
            batch_size: 4,
            expert_type: AnyMoeExpertType::LoraAdapter {
                rank: 64,
                alpha: 16.,
                target_modules: vec!["gate_proj".to_string()],
            },
            gate_model_id: None, // Set this to Some("path/to/model/id") for the pretrained gating model id
            training: true,
            loss_csv_path: None,
        },
        "model.layers",
        "mlp",
        "examples/amoe.json",
        vec!["HuggingFaceH4/zephyr-7b-beta"],
        vec![0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    )
    .build()
    .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Matformer (Matryoshka Transformer) Support

Matformer allows you to dynamically resize transformer models at runtime, trading compute/memory for quality. This enables deploying the same model across devices with different resource constraints - from edge devices to powerful GPUs.

Quick Start

Command Line

# Run Gemma 3n with the E2.49B configuration (2.49B params instead of 3.98B)
mistralrs run -m google/gemma-3n-E4B-it \
  --matformer-config-path matformer_configs/gemma3n.csv \
  --matformer-slice-name "Config for E2.49B (block-level)"

Python

from mistralrs import Runner, Which, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3n-E4B-it",
        arch=VisionArchitecture.Gemma3n,
        matformer_config_path="matformer_configs/gemma3n.csv",
        matformer_slice_name="Config for E2.49B (block-level)",
    ),
)

Rust

#![allow(unused)]
fn main() {
use mistralrs::VisionModelBuilder;
use std::path::PathBuf;

let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
    .with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
    .with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
    .build()
    .await?;
}

How It Works

Matformer models are pre-trained with a special architecture that allows certain layers to be skipped at inference time while maintaining reasonable quality. When you select a “slice”:

  1. Layer Skipping: Specified layers are completely removed from computation
  2. FFN Resizing: Feed-forward network dimensions can be adjusted per layer
  3. Automatic Remapping: Remaining layers are renumbered sequentially

For example, the Gemma 3n E2.49B (block-level) slice:

  • Keeps all 35 layers (no layer skipping)
  • Uses mixed FFN dimensions: 8192 for layers 0-19, 16384 for layers 20-24, 8192 for layers 25-34
  • Cuts parameters from 3.98B to 2.49B (~37% reduction)
  • Maintains ~87% of the full model’s quality

Configuration Files

Matformer configurations are CSV files with these columns:

name,# Layers,# Effective Params (B),MMLU PT accuracy,FFN Hidden Dims,Layers Skipped
Main model,35,3.98,62.30%,"[16384, 16384, ...]",
Config for E2.49B (block-level),35,2.49,54.50%,"[8192, 8192, ..., 16384, 16384, ..., 8192, 8192, ...]",
  • name: Slice identifier used in matformer_slice_name
  • # Layers: Number of active layers after skipping
  • # Effective Params (B): Approximate parameter count in billions
  • MMLU PT accuracy: Benchmark score (informational)
  • FFN Hidden Dims: List of FFN dimensions for each layer
  • Layers Skipped: Which layers to remove (0-indexed)

Supported Models

Currently supported:

  • Gemma 3n (google/gemma-3n-E4B-it) - Multimodal model with vision and audio

See matformer_configs/ for available configurations.

Performance Guide

Memory Usage

Memory scales approximately with parameter count:

  • Full model (3.98B): ~8GB VRAM
  • E2.49B slice: ~5GB VRAM
  • E2B slice (1.91B): ~4GB VRAM
  • Smaller slices: Proportionally less

Inference Speed

Speed improvement is roughly linear with layer count:

  • 30 layers vs 35 layers = ~14% faster
  • 20 layers vs 35 layers = ~43% faster

Quality Trade-offs

Example accuracy on MMLU benchmark:

  • Full model: 62.3%
  • E2.98B: 59.5% (-4.5%)
  • E2.49B: 54.5% (-12.5%)
  • E2B: 50.9% (-18.3%)

Choose based on your requirements:

  • Maximum quality: Use full model (omit matformer args)
  • Balanced: E2.49B to E2.98B configurations (block-level configs recommended)
  • Resource-constrained: E2B configuration (1.91B params)
  • Extreme efficiency: E1.96B configuration

Advanced Usage

With Quantization

Combine Matformer with ISQ for maximum efficiency:

runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3n-E4B-it",
        arch=VisionArchitecture.Gemma3n,
        matformer_config_path="matformer_configs/gemma3n.csv",
        matformer_slice_name="Config for E2.49B (block-level)",
    ),
    in_situ_quant="Q4K"  # 4-bit quantization
)

With Device Mapping

Matformer works seamlessly with automatic device mapping:

#![allow(unused)]
fn main() {
use mistralrs::{VisionModelBuilder, DeviceMapSetting, AutoDeviceMapParams};

let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
    .with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
    .with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
    .with_device_mapping(DeviceMapSetting::Auto(
        AutoDeviceMapParams::default_vision()
    ))
    .build()
    .await?;
}

Only active layers are loaded to GPU, saving memory.

Creating Custom Configurations

To create your own Matformer configuration:

  1. Start with the full model as baseline
  2. Identify skippable layers:
    • Middle layers (10-30) are often good candidates
    • Avoid early layers (feature extraction) and late layers (final representations)
    • Never skip special layers (KV-sharing, attention patterns)
  3. Test quality degradation at each configuration
  4. Create CSV file with your configurations

Example minimal configuration:

name,# Layers,# Effective Params (B),FFN Hidden Dims,Layers Skipped
Tiny,15,0.8,"[4096, 4096, ...]","[5,6,7,10,11,12,15,16,17,20,21,22,25,26,27,30,31,32,33,34]"

API Reference

Command Line Arguments

  • --matformer-config-path PATH: Path to CSV configuration file
  • --matformer-slice-name NAME: Exact name of slice from CSV

Python Parameters

Which.VisionPlain(
    model_id: str,
    arch: VisionArchitecture,
    matformer_config_path: str = None,  # Path to CSV
    matformer_slice_name: str = None,   # Slice name
    # ... other parameters
)

Rust Methods

#![allow(unused)]
fn main() {
// For VisionModelBuilder
.with_matformer_config_path(path: PathBuf)
.with_matformer_slice_name(name: String)

// For TextModelBuilder (when supported)
.with_matformer_config_path(path: PathBuf)  
.with_matformer_slice_name(name: String)
}

Troubleshooting

Common Issues

“Matformer slice ‘X’ not found”

  • Check slice name matches exactly (case-sensitive)
  • Verify CSV file path is correct

“Layers X and Y are reserved and cannot be skipped”

  • Some models have special layers that must not be skipped
  • Try different layer combinations

Memory not reduced as expected

  • Ensure you’re using the slice (check logs)
  • Skipped layers still need to be loaded initially
  • Consider combining with quantization

Debugging

Enable logging to see Matformer details:

RUST_LOG=mistralrs_core=info mistralrs ...

This shows:

  • Configuration file loaded
  • Selected slice details
  • Layers being skipped
  • Final layer count

Future Plans

  • Support for more model architectures
  • Dynamic slice switching during runtime
  • Automatic slice selection based on available resources
  • Fine-tuning tools for creating new Matformer models

References

Device mapping

In mistral.rs, device mapping is automatically managed to be as performant and easy as possible. Automatic device mapping is enabled by default in the CLI/server and Python SDK and does not make any changes when the model fits entirely on the GPU.

Note

If your system has more than one CUDA device, mistral.rs will automatically use tensor parallelism. If the model does not completely fit on the available GPUs, or you wish to use automatic device mapping, you can disable tensor parallelism by setting MISTRALRS_NO_NCCL=1.

Automatic device mapping works by prioritizing loading models into GPU memory, and any remaining parts are loaded into CPU memory. Models architectures such as vision models which greatly benefit from GPU acceleration also automatically prioritize keeping those components on the GPU.

To control the mapping across devices, you can set the following maximum parameters which the model should expect in a prompt.

  • maximum sequence length (default: 4096)
  • maximum batch size (default: 1)
  • (vision models) maximum image length (length refers to the edge length) (default: 1024)
  • (vision models) maximum number of images (default: 1)

These parameters do not translate to hard limits during runtime, they only control the mapping.

Note

The maximum sequence length is also used to ensure that a KV cache will fit for with and without PagedAttention.

Examples


If you want to manually device map the model (not recommended), please continue reading.

Note

Manual device mapping is deprecated in favor of automatic device mapping due to the possibility for user error in manual.

Manual device mapping

There are 2 ways to do device mapping:

  1. Specify the number of layers to put on the GPU - this uses the GPU with ordinal 0.
  2. Specify the ordinals and number of layers - this allows for cross-GPU device mapping.

The format for the ordinals and number of layers is ORD:NUM;... where ORD is the unique ordinal and NUM is the number of layers for that GPU. This may be repeated as many times as necessary.

Note: We refer to GPU layers as “device layers” throughout mistral.rs.

Example of specifying ordinals

mistralrs run -n "0:16;1:16" -m gradientai/Llama-3-8B-Instruct-262k

Note: In the Python SDK, the “0:16;1:16” string is passed as the list ["0:16", "1:16"].

Example of specifying the number of GPU layers

mistralrs run -n 16 -m gradientai/Llama-3-8B-Instruct-262k

PagedAttention in mistral.rs

Mistral.rs supports PagedAttention (paper here) to accelerate both normal inference and batched inference on:

  • CUDA (Unix-like platforms such as WSL, Linux)
  • Metal

Our PagedAttention implementation has 2 inputs: GPU KV cache memory size, and block size. This enables you to have fine-tuned control over the available context length, by configuring the available memory for KV cache. When using a CUDA device, PagedAttention is actiated by default but can be disabled with no_paged_attn for Python or no-paged-attn for the CLI tools.

KV Cache Quantization

PagedAttention now supports KV cache quantization to reduce memory usage and potentially improve performance. The KV cache can be quantized to FP8 (F8E4M3 format) instead of using the model’s native dtype, significantly reducing memory requirements while maintaining model quality.

Available cache types:

  • auto (default): Uses the model’s native dtype for KV cache
  • f8e4m3: Quantizes KV cache to 8-bit floating point (E4M3 format)

When using FP8 quantization, the memory usage for KV cache is approximately halved compared to FP16, allowing for longer context lengths with the same GPU memory allocation.

Note: The default block size if not specified is 32.

Note: if OOM occurs (this can be caused by a variety of factors including adapter activation, re-ISQ, and others), it is likely because the PagedAttention KV cache has already been allocated. To counter this, either set the KV cache memory to a lower amount or usage percentage (recommended) or disable paged attention entirely for a dynamically allocated cache.

Note: Paged Attention is not enabled on Windows platforms, only Unix-based platforms.

Note: In the CLI and Python SDK, Paged Attention is disabled by default for Metal. It can be enabled with the --paged-attn/paged_attn flags.

There are more features being added to this:

  • GGML model support
  • Adapter model support
  • Speculative decoding

Prefix caching is now supported with PagedAttention. PagedAttention can leverage the prefix cacher to cache KV prefix states across iterations for faster multi-turn inference.

Block-Level Prefix Caching

Prefix caching is a technique to reuse computed KV cache blocks across requests that share common prefixes (like system prompts). This can significantly speed up inference when multiple requests use the same prefix.

How It Works

  1. Block Hashing: Each block of tokens is assigned a unique hash based on its contents and the hash of its parent block:

    hash(block) = hash(parent_hash, block_tokens)
    

    This creates a hash chain that uniquely identifies any prefix sequence.

  2. Cache Lookup: When allocating blocks for a new request, the scheduler checks if any full blocks match existing cached blocks by comparing hashes.

  3. Block Reuse: Matched blocks are reused directly - their pre-computed KV cache values are used without recomputation. Only the non-matching suffix tokens need to be processed.

  4. LRU Eviction: When memory is needed, least recently used cached blocks are evicted first.

Benefits

  • Multi-turn conversations: System prompts and conversation history are cached and reused
  • Batched requests: Multiple requests with shared prefixes (e.g., same system prompt) benefit from caching
  • Reduced TTFT: Time-to-first-token is reduced by skipping prefix computation

How It’s Enabled

Prefix caching is enabled by default when using PagedAttention and controlled by the same prefix_cache_n setting that controls the sequence-level prefix cacher:

  • CLI: --prefix-cache-n <N> (default 16). Set to 0 to disable prefix caching.
  • Python SDK: prefix_cache_n=<N> (default 16). Set to None or 0 to disable.
  • Rust SDK: .with_prefix_cache_n(Some(N)) (default 16). Pass None to disable.

Important: The two prefix caching systems are mutually exclusive:

  • PagedAttention uses block-level prefix caching (handled by PrefixCacher in BlockEngine)
  • Non-PagedAttention uses sequence-level prefix caching (handled by PrefixCacheManagerV2)

The prefix_cache_n setting controls both systems, but only one is active depending on whether PagedAttention is enabled. You’ll see one of these log messages at startup indicating which system is active:

  • Prefix caching enabled (block-level, PagedAttention).
  • Prefix caching enabled (sequence-level, non-paged attention).

Implementation Details

The prefix cache operates at the block level (not token level) for efficiency:

  1. Full blocks only: Only complete blocks (block_size tokens) are cached. Partial blocks at the end of a sequence are not cached.

  2. Hash chain: The hash for each block depends on all preceding blocks, ensuring the entire prefix matches.

  3. Copy-on-Write: Cached blocks use reference counting. When a cached block needs modification, it’s copied first (CoW).

  4. Memory management: The cache uses LRU eviction when allocating new blocks. Evicted blocks are returned to the free pool.

Performance Considerations

  • Block size affects cache granularity: larger blocks = fewer cache entries but coarser matching
  • Cache hit rate improves with more repeated prefixes
  • Memory overhead is minimal (just hash-to-block mappings)

Supported models:

  • Normal models
  • GGUF models
  • Vision models

Note: Prefix caching is supported when using PagedAttention. Configure the number of sequences to cache on the device with:

  • CLI: --prefix-cache-n <N> (default 16)
  • Python SDK: prefix_cache_n=<N> (default 16)
  • Rust SDK: .with_prefix_cache_n(Some(N)) (default 16)

FlashAttention V2/V3 + PagedAttention in mistral.rs

If mistral.rs is compiled with FlashAttention and PagedAttention is enabled, then FlashAttention will be used in tandem to accelerate the prefill phase.

Using the CLI

Add the --pa-gpu-mem/--pa-gpu-mem-usage and --pa-blk-size parameters before the model kind selector. The GPU memory is in MBs and the block size means the number of tokens per block. These parameters may be passed on any supported model type.

To enable KV cache quantization, use the --pa-cache-type parameter with either auto (default) or f8e4m3.

mistralrs run --pa-memory-mb 8192 --pa-block-size 32 --isq 4 -m microsoft/Phi-3-mini-128k-instruct
mistralrs run --pa-memory-fraction 0.95 --pa-block-size 32 --format gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Example with FP8 KV cache quantization:

mistralrs run --paged-attn on --pa-memory-mb 4096 --pa-block-size 32 --pa-cache-type f8e4m3 -m microsoft/Phi-3-mini-128k-instruct

Using the Rust SDK

You can find this example here.

use anyhow::Result;
use mistralrs::{
    IsqType, MemoryGpuConfig, PagedAttentionMetaBuilder, TextMessageRole, TextMessages,
    TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(|| {
            PagedAttentionMetaBuilder::default()
                .with_block_size(32)
                .with_gpu_memory(MemoryGpuConfig::ContextSize(1024))
                .build()
        })?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Example with FP8 KV cache quantization:

use anyhow::Result;
use mistralrs::{
    IsqType, MemoryGpuConfig, PagedAttentionMetaBuilder, PagedCacheType, 
    TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(|| {
            PagedAttentionMetaBuilder::default()
                .with_block_size(32)
                .with_gpu_memory(MemoryGpuConfig::ContextSize(1024))
                .with_cache_type(PagedCacheType::F8E4M3)
                .build()
        })?
        .build()
        .await?;

    // ... rest of the code remains the same
}

Using the Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="mistralai/Mistral-7B-Instruct-v0.1",
        arch=Architecture.Mistral,
    ),
    pa_gpu_mem = 4096,
    pa_blk_size = 32,
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Example with FP8 KV cache quantization:

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture, PagedCacheType

runner = Runner(
    which=Which.Plain(
        model_id="mistralai/Mistral-7B-Instruct-v0.1",
        arch=Architecture.Mistral,
    ),
    pa_gpu_mem = 4096,
    pa_blk_size = 32,
    pa_cache_type = PagedCacheType.F8E4M3,
)

# ... rest of the code remains the same

Speculative Decoding

Speculative decoding is an inference acceleration technique that uses a smaller “draft” model to propose tokens, which are then validated in parallel by the larger “target” model. This can significantly speed up generation when the draft model frequently predicts tokens the target model would also choose.

Mistral.rs implements speculative decoding based on the paper: Fast Inference from Transformers via Speculative Decoding.

How It Works

  1. The draft model generates gamma candidate tokens autoregressively
  2. The target model evaluates all candidate tokens in a single forward pass
  3. Using rejection sampling, tokens are accepted or rejected:
    • Accept if the target model’s probability >= draft model’s probability
    • Otherwise, accept with probability p_target(x) / p_draft(x)
    • If rejected, sample from the normalized difference distribution

This approach guarantees the same output distribution as running the target model alone, while often achieving significant speedups.

Configuration

The key parameter is gamma - the number of draft tokens to generate per speculation step. Higher values can increase throughput when the draft model is accurate, but waste computation when predictions are frequently rejected.

Recommended values: Start with gamma = 12-32 and tune based on your models and workload.

Requirements

  • Same tokenizer: Both target and draft models must share the same tokenizer vocabulary
  • Same model category: Both must be the same type (e.g., both text models or both vision models)
  • KV cache enabled: Both models must have KV caching enabled (default behavior)

Limitations

Note: PagedAttention is not currently supported with speculative decoding.

Note: Prefix caching is not supported with speculative decoding.

Note: Hybrid KV caches are not supported with speculative decoding.

Using TOML Configuration

The recommended way to configure speculative decoding is via TOML. Create a config file (e.g., speculative.toml):

[model]
model_id = "meta-llama/Llama-3.1-8B-Instruct"

[speculative]
gamma = 12

[speculative.draft_model]
model_id = "meta-llama/Llama-3.2-1B-Instruct"

Then run with:

mistralrs run --from-toml speculative.toml

The draft model can use any supported format (Plain, GGUF, etc.) and can have different quantization than the target model.

TOML with GGUF Draft Model

[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"

[speculative]
gamma = 16

[speculative.draft_model]
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
model_file = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"

TOML with ISQ Quantization

[model]
model_id = "meta-llama/Llama-3.1-8B-Instruct"

[speculative]
gamma = 16

[speculative.draft_model]
model_id = "meta-llama/Llama-3.2-1B-Instruct"
isq = "Q8_0"

Using the Python SDK

from mistralrs import Runner, Which, ChatCompletionRequest, Architecture

runner = Runner(
    which=Which.Plain(
        model_id="mistralai/Mistral-7B-Instruct-v0.1",
        arch=Architecture.Mistral,
    ),
    which_draft=Which.GGUF(
        tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
        quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
        quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    ),
    speculative_gamma=32,
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "Tell me a story about the Rust type system."}
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

Python SDK Parameters

ParameterTypeDescription
which_draftWhichDraft model specification (Plain, GGUF, etc.)
speculative_gammaintNumber of draft tokens per step (default: 32)

Using the Rust SDK

You can find this example at mistralrs/examples/speculative/main.rs.

use anyhow::Result;
use mistralrs::{
    IsqType, RequestBuilder, SpeculativeConfig, TextMessageRole, TextMessages,
    TextModelBuilder, TextSpeculativeBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let target = TextModelBuilder::new("meta-llama/Llama-3.1-8B-Instruct")
        .with_logging();
    let draft = TextModelBuilder::new("meta-llama/Llama-3.2-1B-Instruct")
        .with_logging()
        .with_isq(IsqType::Q8_0);
    let spec_cfg = SpeculativeConfig { gamma: 16 };

    let model = TextSpeculativeBuilder::new(target, draft, spec_cfg)?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Choosing Draft and Target Models

For best performance:

  1. Use the same model family - Draft models from the same family as the target (e.g., Llama 3.2-1B with Llama 3.1-8B) typically have higher acceptance rates
  2. Smaller is better for draft - The draft model should be significantly smaller than the target for meaningful speedup
  3. Quantize the draft model - Using ISQ or GGUF quantization on the draft model reduces memory and improves draft generation speed
  4. Tune gamma - Monitor acceptance rates and adjust gamma accordingly

Example Model Pairings

Target ModelDraft ModelNotes
Llama 3.1-8BLlama 3.2-1BSame family, good acceptance
Llama 3.1-70BLlama 3.1-8BLarge speedup potential
Mistral-7BMistral-7B (Q4_K_M GGUF)Same model, quantized draft

Performance Considerations

  • Acceptance rate: Higher acceptance rates lead to better speedups. Monitor your logs for rejection statistics.
  • Draft model overhead: If the draft model is too large relative to the target, the overhead may negate speedup benefits.
  • Batch size: Speculative decoding is most beneficial for single-request scenarios. For high-throughput batch inference, standard decoding may be more efficient.
  • Memory usage: Both models must fit in memory simultaneously. Consider quantizing one or both models.

Combining with Other Features

Speculative decoding can be combined with:

  • ISQ quantization - Quantize target, draft, or both models
  • X-LoRA adapters - Use adapters on the target model
  • Device mapping - Distribute models across multiple GPUs

See examples/python/speculative_xlora.py for an example combining speculative decoding with X-LoRA.

FlashAttention in mistral.rs

Mistral.rs supports FlashAttention V2 and V3 on CUDA devices (V3 is only supported when CC >= 9.0).

Note: If compiled with FlashAttention and PagedAttention is enabled, then FlashAttention will be used in tandem to accelerate the prefill phase.

GPU Architecture Compatibility

ArchitectureCompute CapabilityExample GPUsFeature Flag
Ampere8.0, 8.6RTX 30*, A100, A40--features flash-attn
Ada Lovelace8.9RTX 40*, L40S--features flash-attn
Hopper9.0H100, H800--features flash-attn-v3
Blackwell10.0, 12.0RTX 50*--features flash-attn

Note: FlashAttention V2 and V3 are mutually exclusive Note: To use FlashAttention in the Python SDK, compile from source.

Multi-head Latent Attention (MLA) in mistral.rs

Multi-head Latent Attention (MLA) is an efficient attention mechanism that reduces KV cache memory usage by compressing key-value states into a low-rank latent space. This technique was introduced in DeepSeek V2 and is also used in DeepSeek V3 and GLM-4.7-Flash models.

How It Works

MLA compresses the key-value cache by:

  1. Projecting KV states into a compact latent representation (kv_lora_rank dimensions)
  2. Storing only the compressed latent vectors and rotary position embeddings in the KV cache
  3. Reconstructing full KV states on-the-fly during attention computation

This results in significant memory savings compared to standard multi-head attention, enabling longer context lengths with the same GPU memory.

Supported Models

MLA is automatically enabled for the following model architectures when using PagedAttention on CUDA:

ModelArchitectureMLA Dimensions
DeepSeek V2deepseekv2kv_lora_rank varies
DeepSeek V3deepseekv3kv_lora_rank=512, kpe_head_dim=64
GLM-4.7-Flashglm4moelitekv_lora_rank=512, kpe_head_dim=64

Requirements

MLA decode optimization requires:

  • CUDA on Unix-like platforms (Linux, WSL)
  • PagedAttention enabled
  • Compatible model architecture (see table above)

When these conditions are met, MLA is automatically used during the decode phase for optimal performance.

Performance Benefits

MLA provides two key optimizations:

  1. Reduced KV Cache Memory: The compressed latent representation uses significantly less memory than full key-value states, allowing for:

    • Longer context lengths
    • Larger batch sizes
    • More efficient memory utilization
  2. Optimized Decode Kernels: Custom FlashInfer-based MLA kernels accelerate single-token generation by:

    • Operating directly on compressed latent states
    • Avoiding repeated KV decompression
    • Leveraging efficient memory access patterns

Disabling MLA

If you encounter issues or want to compare performance, you can disable MLA by setting the environment variable:

MISTRALRS_NO_MLA=1 mistralrs ...

When disabled, the model falls back to standard PagedAttention with full KV cache storage.

Technical Details

KV Cache Layout

When MLA is enabled, PagedAttention uses a specialized cache layout:

  • Key cache: Stores compressed latent vectors (kv_lora_rank dimensions) + rotary position embeddings (kpe_head_dim dimensions)
  • Value cache: Shares the same block structure for efficient memory management

Decode Path

During single-token generation (decode phase):

  1. Query is projected to latent space
  2. Attention is computed directly on compressed KV states using FlashInfer MLA kernels
  3. Output is projected back from latent space

Prefill Path

During prompt processing (prefill phase):

  1. Full KV states are computed for the current chunk
  2. Compressed latents are stored in the PagedAttention cache
  3. For prefix-cached sequences, latents are retrieved and decompressed as needed

See Also

Distributed inference in mistral.rs

Mistral.rs supports distributed inference with a few strategies

What backend is best?

  • For CUDA-only system: NCCL
  • Anything else: Ring backend

The Ring backend is also heterogenous! This means that you can use the Ring backend on any set of multiple devices connected over TCP. For example, you can connect 2 Metal systems, or 2 Metal and 1 CPU system with the Ring backend!

NCCL in mistral.rs

Mistral.rs supports distributed inference on CUDA with Tensor Parallelism via NCCL.

Note: Multi-node support is coming! Distributed inference on Apple hardware is also being investigated.

Tensor Parallelism (TP) is automatically used to accelerate distributed inference when more than one CUDA GPUs are detected. The tensor parallelism size is always automatically set to the total number of GPUs.

TP splits the model into shards and benefits from fast single-node interconnects like NVLink. Both normal and vision models support tensor parallelism.

Important: The world size (total number of GPUs) must be a power of 2 (e.g., 1, 2, 4, 8, 16, 32, etc.). This is a requirement for optimal performance and correct operation of the distributed algorithms.

Note: In mistral.rs, if NCCL is enabled, then automatic device mapping will not be used.

Important: To build for NCCL, be sure to add the nccl feature flag (for example: --features nccl,cuda).

See the following environment variables:

NameFunctionUsage
MISTRALRS_NO_NCCL=1Disable TP and NCCLIf the model does not fit on the available CUDA devices, disabling NCCL will re-enable automatic device mapping

Single-Node Support

Set the number of ranks using MISTRALRS_MN_LOCAL_WORLD_SIZE, e.g.,

MISTRALRS_MN_LOCAL_WORLD_SIZE=2 mistralrs serve -p 8000 -m Qwen/Qwen3-30B-A3B-Instruct-2507

where, if no MISTRALRS_MN_LOCAL_WORLD_SIZE env given, mistral.rs will split the model across all available devices.

Multi-node support

# Head node:
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_HEAD_NUM_WORKERS=1 MISTRALRS_MN_HEAD_PORT=<PORT> mistralrs run -m ...

# For the worker nodes:
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_WORKER_ID=0 MISTRALRS_WORKER_SERVER_ADDR=<HEAD ADDR>:<PORT> mistralrs run -m ...
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_WORKER_ID=1 MISTRALRS_WORKER_SERVER_ADDR=<HEAD ADDR>:<PORT> mistralrs run -m ...
MISTRALRS_MN_GLOBAL_WORLD_SIZE=32 MISTRALRS_MN_WORKER_ID=2 MISTRALRS_WORKER_SERVER_ADDR=<HEAD ADDR>:<PORT> mistralrs run -m ...

Multi-node support in mistral.rs divides the nodes into two groups: a “head” node, and multiple “worker” nodes. Head node choice is arbitrary. For example, if a system has 8 nodes, there will be 1 “head” node, and 7 “worker” nodes.

To enable multi-node, set the MISTRALRS_MN_GLOBAL_WORLD_SIZE=<number> environment variable to the total number of GPUs in all nodes, including “head” and “worker“s. Note: This number must be a power of 2.

It is recommended to use server mode with mistral.rs when in multi-node. Currently, you must send requests to every node!

The following environment variables must be set for each node:

Head node:

NameFunctionUsage
MISTRALRS_MN_HEAD_NUM_WORKERS=<number>The number of worker nodes which will be connected.This should be the number of nodes in the system, minus 1 for the head node.
MISTRALRS_MN_HEAD_PORT=<PORT>The port on which to communicate with the worker nodes.Worker nodes will connect to this port via TCP sockets

Worker node:

NameFunctionUsage
MISTRALRS_MN_WORKER_ID=<number>The 0-indexed worker ID for this worker node.If there are 4 nodes (1 head, 3 workers), then the worker ids will be 0, 1, and 2
MISTRALRS_MN_WORKER_SERVER_ADDR=<ADDR>:<PORT>The IP address and port to connect to the server.This is used to establish communication with the head node.

Ring backend in mistral.rs

Mistral.rs provides a TCP-based ring backend for distributed tensor-parallel inference. This backend is enabled by compiling with the ring feature and implements collective operations over a ring topology using TCP sockets.

Prerequisites

  • Build with the ring feature enable, in addition to any others:
    cargo build --release --features ring
    
  • Ensure the specified TCP ports are open and reachable between processes.
  • The world_size must be a power of 2 (2, 4, 8, 16, etc.) for correct operation.

Configuration

Create one JSON configuration file per process with the following fields:

FieldTypeDescription
master_ipstringOptional. IP address for master node.
master_portintegerOptional. Port for master node.
portintegerLocal port to bind for incoming connections from the left neighbor.
right_portintegerPort on which the right neighbor is listening (used to connect outgoing to the right).
right_ipstringOptional. IP address of the right neighbor (defaults to 0.0.0.0).
rankintegerRank of this process in [0..world_size).
world_sizeintegerTotal number of processes in the ring. Must be a power of 2 (e.g., 2, 4, 8, 16, etc.).

This address and port should form a ring topology for each of the nodes. For example, the last node should point to the first node as its right neighbor.

Although all processes participate in collective communication, Rank 0 acts as the master node. For example, interactive mode or the server runs on Rank 0, while other ranks act as background workers.

Example ring topology:

+---------+         +---------+
| Rank 0  | ----->  | Rank 1  |
| IP: A   |         | IP: B   |
| Port: X |         | Port: Y |
+----+----+         +----+----+
     ^                   |
     |                   v
+----+----+         +----+----+
| Rank 3  | <-----  | Rank 2  |
| IP: D   |         | IP: C   |
| Port: W |         | Port: Z |
+---------+         +---------+

Each node connects to its right neighbor by IP and port, and the last node wraps around to the first.

Example for two processes:

  • ring_0.json:

    {
      "master_ip": "0.0.0.0",
      "master_port": 1234,
      "port": 12345,
      "right_port": 12346,
      "rank": 0,
      "world_size": 2
    }
    
  • ring_0.json:

    {
      "master_ip": "0.0.0.0",
      "master_port": 1234,
      "port": 12346,
      "right_port": 12345,
      "rank": 1,
      "world_size": 2
    }
    

Multi-Machine Example

To run on different machines, update the right_ip field in each config to the actual IP address of the neighbor process. For example, if you have two machines with IPs 192.168.1.10 and 192.168.1.11:

  • ring_0.json on Machine A (192.168.1.10):

    {
      "port": 12345,
      "right_port": 12346,
      "right_ip": "192.168.1.11",
      "rank": 0,
      "world_size": 2
    }
    
  • ring_1.json on Machine B (192.168.1.11):

    {
      "port": 12346,
      "right_port": 12345,
      "right_ip": "192.168.1.10",
      "rank": 1,
      "world_size": 2
    }
    

Make sure that the specified ports are open and that each machine can reach the other via TCP on those ports.

Usage

Set the RING_CONFIG environment variable to point to the JSON file for each process, then run your application built with the ring feature:

# Process 0 or computer 0
export RING_CONFIG=path/to/ring_0.json
cargo run --release --features ring -- ...

# Process 1 or computer 1
export RING_CONFIG=path/to/ring_1.json
cargo run --release --features ring -- ...

The ring backend will automatically handle collective communication for tensor-parallel inference.

Tool calling

Tool calling makes LLMs smarter.

LLMs use tool calling to interact with the outside world. Mistral.rs has OpenAI compatible support for tool calling in all APIs, HTTP, Python, and Rust.

Note that some models, such as Mistral Small/Nemo models, require a chat template to be specified. For example:

mistralrs serve -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503

OpenAI docs: https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models

We support the following models’ tool calling in OpenAI-compatible and parse native tool calling:

  • Llama 4
  • Llama 3.1/3.2/3.3
  • Mistral Small (including 3.1 + multimodal)
  • Mistral Nemo
  • Hermes 2 Pro
  • Hermes 3
  • DeepSeek V2/V3/R1
  • Qwen 3

All models that support tool calling will respond according to the OpenAI tool calling API.

OpenAI compatible HTTP example

Please see our example here.

OpenAI docs: https://platform.openai.com/docs/api-reference/chat/create?lang=curl

Rust example

Please see our example here.

Python example

Please see our notebook here.

Tool callbacks

You can override tool execution using a tool callback. The callback receives the tool name and a dictionary of arguments and must return the tool output as a string.

Python

def tool_cb(name: str, args: dict) -> str:
    if name == "local_search":
        return json.dumps(local_search(args.get("query", "")))
    return ""

runner = Runner(
    which=Which.Plain(model_id="YourModel/ID", arch=Architecture.Llama),
    tool_callback=tool_cb,
)

See custom_search.py for a full example. In Rust pass .with_tool_callback(...) to the builder as demonstrated in custom_search/main.rs.

Search callbacks

Web search uses a DuckDuckGo-based callback by default. Provide your own search function with search_callback in Python or .with_search_callback(...) in Rust. Each callback should return a list of results with title, description, url and content fields. See WEB_SEARCH.md for more details and examples.

Web search tool in mistral.rs

mistral.rs is compatible with OpenAI’s web_search_options parameter! Once enabled, this allows web searching for models.

This works with all models that support tool calling. However, your mileage may vary depending on the specific model. The following models work during testing and are recommended for usage:

  • Hermes 3 3b/8b
  • Mistral 3 24b
  • Llama 4 Scout/Maverick
  • Qwen 3 (⭐ Recommended!)

Web search is supported both in streaming and completion responses! This makes it easy to integrate and test out in interactive mode!

Besides tool calling and parsing of web content, we also use an embedding model to select the most relevant search results.

You can use the web search tool in all the APIs: Python, Rust, and server.

Selecting a search embedding model

Internally, we now use google/embeddinggemma-300m to embed documents for ranking. You can pick from the built-in reranker variants (currently just embedding_gemma) in every API:

  • Rust: with_search(SearchEmbeddingModel::EmbeddingGemma300M) in the builder
  • Python: search_embedding_model="embedding_gemma" in the Runner
  • Server: --search-embedding-model embedding_gemma flag

Specifying a custom search callback

By default, mistral.rs uses a DuckDuckGo-based search callback. To override this, you can provide your own search function:

  • Rust: use .with_search_callback(...) on the model builder with an Arc<dyn Fn(&SearchFunctionParameters) -> anyhow::Result<Vec<SearchResult>> + Send + Sync>.
  • Python: pass the search_callback keyword argument to Runner, which should be a function def search_callback(query: str) -> List[Dict[str, str]] returning a list of results with keys "title", "description", "url", and "content".

Example in Python:

def search_callback(query: str) -> list[dict[str, str]]:
    # Implement your custom search logic here, returning a list of result dicts
    return [
        {
            "title": "Example Result",
            "description": "An example description",
            "url": "https://example.com",
            "content": "Full text content of the page",
        },
        # more results...
    ]

from mistralrs import Runner, Which, Architecture
runner = Runner(
    which=Which.Plain(model_id="YourModel/ID", arch=Architecture.Mistral),
    enable_search=True,
    search_callback=search_callback,
)

HTTP server

Be sure to add --enable-search!

Here are some examples using various models. Note that this works for both streaming and completion requests, so interactive mode is featured here!

mistralrs run --enable-search --isq 4 -m Qwen/Qwen3-4B
mistralrs serve --enable-search -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503
mistralrs run --enable-search --isq 4 -m NousResearch/Hermes-3-Llama-3.1-8B
from openai import OpenAI

client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

messages = [
    {
        "role": "user",
        "content": "Can you show me some code using mistral.rs for running Llama 3.2 Vision?",
    }
]

completion = client.chat.completions.create(
    model="default",
    messages=messages,
    tool_choice="auto",
    max_tokens=1024,
    web_search_options={},
)

# print(completion.usage)
print(completion.choices[0].message.content)

if completion.choices[0].message.tool_calls is not None:
    # Should never happen.
    tool_called = completion.choices[0].message.tool_calls[0].function
    print(tool_called)

Python SDK

from mistralrs import (
    Runner,
    Which,
    ChatCompletionRequest,
    Architecture,
    WebSearchOptions,
)

# Define a custom search callback if desired
def my_search_callback(query: str) -> list[dict[str, str]]:
    # Fetch or compute search results here
    return [
        {
            "title": "Mistral.rs GitHub",
            "description": "Official mistral.rs repository",
            "url": "https://github.com/EricLBuehler/mistral.rs",
            "content": "mistral.rs is a Rust binding for Mistral models...",
        },
    ]

runner = Runner(
    which=Which.Plain(
        model_id="NousResearch/Hermes-3-Llama-3.1-8B",
        arch=Architecture.Llama,
    ),
    enable_search=True,
    search_callback=my_search_callback,
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": "Can you show me some code using mistral.rs for running Llama 3.2 Vision?",
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
        web_search_options=WebSearchOptions(
            search_context_size=None, user_location=None
        ),
    )
)
print(res.choices[0].message.content)
print(res.usage)

Rust SDK

use anyhow::Result;
use mistralrs::{
    SearchEmbeddingModel, IsqType, RequestBuilder, TextMessageRole, TextMessages, TextModelBuilder,
    WebSearchOptions,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("NousResearch/Hermes-3-Llama-3.1-8B")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .with_search(SearchEmbeddingModel::default())
        .build()
        .await?;

    let messages = TextMessages::new().add_message(
        TextMessageRole::User,
        "What is the weather forecast for Boston?",
    );
    let messages =
        RequestBuilder::from(messages).with_web_search_options(WebSearchOptions::default());

    let response = model.send_chat_request(messages).await?;

    println!("What is the weather forecast for Boston?\n\n");
    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Chat templates and tokenizer customization

Some models do not come with support for tool calling or other features, and as such it might be necessary to specify your own chat template.

We provide some chat templates here, and it is easy to modify or create others to customize chat template behavior.

To use this, add the jinja-explicit parameter to the various APIs

mistralrs serve -p 1234 --isq 4 --jinja-explicit chat_templates/mistral_small_tool_call.jinja -m mistralai/Mistral-Small-3.1-24B-Instruct-2503

Chat template overrides

Mistral.rs attempts to automatically load a chat template from the tokenizer_config.json file. This enables high flexibility across instruction-tuned models and ensures accurate chat templating. However, if the chat_template field is missing, then a JINJA chat template should be provided. The JINJA chat template may use messages, add_generation_prompt, bos_token, eos_token, and unk_token as inputs.

We provide some chat templates here, and it is easy to modify or create others to customize chat template behavior.

For example, to use the chatml template, --chat-template is specified before the model architecture. For example:

mistralrs serve -p 1234 --log output.log --chat-template ./chat_templates/chatml.json -m meta-llama/Llama-3.2-3B-Instruct

Note: For GGUF models, the chat template may be loaded directly from the GGUF file by omitting any other chat template sources.

Tokenizer

Some models do not provide a tokenizer.json file although mistral.rs expects one. To solve this, please run this script. It will output the tokenizer.json file for your specific model. This may be used by passing the --tokenizer-json flag after the model architecture. For example:

$ python3 scripts/get_tokenizers_json.py
Enter model ID: microsoft/Orca-2-13b
$ mistralrs serve -p 1234 --log output.log -m microsoft/Orca-2-13b --tokenizer-json tokenizer.json

Putting it all together, to run, for example, an Orca model (which does not come with a tokenizer.json or chat template):

  1. Generate the tokenizer.json by running the script at scripts/get_tokenizers_json.py. This will output some files including tokenizer.json in the working directory.
  2. Find and copy the correct chat template from chat-templates to the working directory (eg., cp chat_templates/chatml.json .)
  3. Run mistralrs serve, specifying the tokenizer and chat template: mistralrs serve -p 1234 --log output.txt --chat-template chatml.json -m microsoft/Orca-2-13b -t tokenizer.json

Note: For GGUF models, the tokenizer may be loaded directly from the GGUF file by omitting the tokenizer model ID.

Sampling and penalty techniques in mistral.rs

mistral.rs supports a comprehensive set of sampling and penalty techniques to control text generation. These can be configured via the HTTP API, Python SDK, or Rust SDK.

Temperature

Controls the randomness of token selection. Lower values make output more deterministic, higher values increase creativity and randomness.

  • Range: 0.0 to 2.0 (typically 0.0 to 1.0)
  • Default: Model-dependent, usually around 0.7
  • Effect: At 0.0, always selects the most likely token (greedy). At higher values, sampling becomes more diverse.

Top K

Limits token selection to the K most likely tokens.

  • Range: 1 to vocabulary size
  • Effect: Lower values restrict choices to only the most probable tokens, reducing randomness.

Top P (Nucleus Sampling)

Limits token selection to the smallest set of tokens whose cumulative probability exceeds P.

  • Range: 0.0 to 1.0
  • Effect: At 0.1, only tokens comprising the top 10% probability mass are considered. More adaptive than Top K as it adjusts based on the probability distribution.

Min P

Filters out tokens with probability less than min_p * max_probability.

  • Range: 0.0 to 1.0
  • Effect: Removes low-probability tokens relative to the most likely token. Useful for preventing unlikely tokens from being selected.

Stop Sequences

Strings that, when generated, cause generation to stop immediately.

  • Type: Array of strings
  • Effect: Generation terminates as soon as any stop sequence is produced. Useful for controlling output boundaries.

Repetition Penalty

Applies a multiplicative penalty to tokens that have already appeared in the context.

  • Range: Typically 1.0 to 2.0
  • Effect: Values > 1.0 make repeated tokens less likely. This is distinct from frequency and presence penalties.

Frequency Penalty

Penalizes tokens based on how many times they’ve appeared in the generated text so far.

  • Range: -2.0 to 2.0
  • Effect: Positive values reduce repetition proportionally to token frequency. Negative values encourage repetition.

Presence Penalty

Penalizes tokens that have appeared at least once in the generated text.

  • Range: -2.0 to 2.0
  • Effect: Positive values discourage any repetition (binary penalty). Negative values encourage reusing tokens.

DRY (Don’t Repeat Yourself) Penalty

An advanced anti-repetition technique that detects and penalizes repeated sequences of tokens, not just individual tokens. See the original implementation for details.

DRY Parameters

  • dry_multiplier: Controls the strength of the penalty. Higher values more strongly discourage repetition.
  • dry_base: Base value for the exponential penalty calculation.
  • dry_allowed_length: Minimum sequence length before the penalty applies. Sequences shorter than this are not penalized.
  • dry_sequence_breakers: Array of tokens (like newlines, punctuation) that reset the sequence tracking. When these tokens appear, the DRY penalty starts fresh.

Example DRY Configuration (HTTP API)

{
  "dry_multiplier": 0.8,
  "dry_base": 1.75,
  "dry_allowed_length": 2,
  "dry_sequence_breakers": ["\n", ".", "!", "?", ";"]
}

API Usage

All sampling parameters can be set in API requests:

HTTP API

{
  "model": "default",
  "messages": [...],
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "min_p": 0.05,
  "repetition_penalty": 1.1,
  "frequency_penalty": 0.5,
  "presence_penalty": 0.5,
  "stop": ["END", "\n\n"],
  "dry_multiplier": 0.8,
  "dry_base": 1.75,
  "dry_allowed_length": 2,
  "dry_sequence_breakers": ["\n"]
}

Python SDK

response = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[...],
        temperature=0.7,
        top_p=0.9,
        top_k=40,
        min_p=0.05,
        repetition_penalty=1.1,
        frequency_penalty=0.5,
        presence_penalty=0.5,
        stop_seqs=["END", "\n\n"],
        dry_multiplier=0.8,
        dry_base=1.75,
        dry_allowed_length=2,
        dry_sequence_breakers=["\n"],
    )
)

Please suggest more sampling techniques by raising an issue!

Structured model loading with .toml files

Mistral.rs supports loading models from a .toml file, and the fields are the same as for the CLI. Please find some example toml selectors here.

There are a few cases which add functionality that cannot be found in the CLI.

Speculative decoding

What to specify

Under [speculative]

  • Specify the gamma parameter

Under [speculative.draft_model]

  • Choose a draft model, just like under [model] (only requirement is that they have the same tokenizer)
[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"

[speculative]
gamma = 32

[speculative.draft_model]
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
quantized_model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
quantized_filename = "mistral-7b-instruct-v0.1.Q2_K.gguf"
mistralrs from-config -f toml-selectors/speculative-gguf.toml

AnyMoE

What to specify

Under [anymoe], required unless specified

  • Specify the dataset
  • Find and specify the prefix/mlp values
    • Go to https://huggingface.co/<MODEL ID>/tree/main?show_file_info=model.safetensors.index.json
    • Look for the mlp layers: For example model.layers.27.mlp.down_proj.weight means that the prefix is model.layers and the mlp is mlp.
  • Specify the expert or LoRA adapter model IDs
  • (Optional) Specify layers to apply AnyMoE to.

Under [anymoe.config]

  • Hidden size, typically found at https://huggingface.co/<BASE MODEL ID>/blob/main/config.json

(For LoRA experts) Under [anymoe.config.expert_type.lora_adapter]

  • Rank
  • Alpha
  • Target modules
mistralrs from-config -f toml-selectors/anymoe.toml

With fine-tuned experts

[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"

[anymoe]
dataset_json = "test.csv"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["HuggingFaceH4/zephyr-7b-beta"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096
expert_type = "fine_tuned"

With LoRA adapter experts

[model]
model_id = "HuggingFaceH4/zephyr-7b-beta"
arch = "mistral"

[anymoe]
dataset_json = "test.csv"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["EricB/example_adapter"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096

[anymoe.config.expert_type.lora_adapter]
rank = 16
alpha = 16
target_modules = ["gate_proj"]

Multi-Model Support

The mistralrs CLI supports loading and serving multiple models simultaneously, allowing you to switch between different models in the same server instance.

  • Each model runs in its own engine thread
  • Models can have different configurations (quantization, device layers, etc.)
  • Memory usage scales with the number of loaded models
  • All models share the same server configuration (port, logging, etc.)
  • Interactive mode uses the default model or the first model if no default is set
  • You can unload all models (including the last one) - they will auto-reload when accessed

Usage

Single-Model Mode (Default)

# Traditional usage - loads one model
mistralrs serve -p 1234 -m meta-llama/Llama-3.2-3B-Instruct

Multi-Model Mode

# Load multiple models from configuration file
mistralrs from-config --file config.toml

Configuration File Format

Create a JSON file with model configurations as object keys:

{
  "llama3-3b": {
    "alias": "llama3-3b",
    "Plain": {
      "model_id": "meta-llama/Llama-3.2-3B-Instruct"
    }
  },
  "qwen3-4b": {
    "alias": "qwen3-4b",
    "Plain": {
      "model_id": "Qwen/Qwen3-4B"
    },
    "in_situ_quant": "Q4K"
  }
}

Configuration Structure

  • Object keys (e.g., "llama3-3b", "qwen3-4b"): Organizational labels (for human readability)
  • API identifiers: By default the pipeline name (usually the model_id inside the model spec). You can override this with alias.
  • Model specification: The model type and configuration (same format as CLI subcommands)
  • Optional fields:
    • alias: Custom model ID (nickname) used in API requests
    • chat_template: Custom chat template
    • jinja_explicit: JINJA template file
    • num_device_layers: Device layer configuration
    • in_situ_quant: In-situ quantization setting

How API identifiers work:

  • ✅ Object keys are organizational only (for config readability)
  • ✅ If alias is set, it becomes the API model ID
  • ✅ Otherwise, the pipeline name (usually the model_id field) is used
  • ✅ The canonical pipeline name remains accepted as an alias for compatibility

API Usage

Selecting Models in Requests

Use the model field in your requests to specify which model to use:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-3b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Default Model Behavior

  • Explicit model: Use the alias if configured (e.g., "llama3-3b"), otherwise the full pipeline name (e.g., "meta-llama/Llama-3.2-3B-Instruct")
  • Default model: Use "default" to explicitly request the default model
  • Auto-fallback: If the model field is omitted entirely, the default model will be used
# Use default model explicitly
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The default model is either:

  1. The model specified with --default-model-id when starting the server
  2. The first model loaded (if no default is explicitly set)

List Available Models

curl http://localhost:1234/v1/models

Returns:

{
  "object": "list",
  "data": [
    {
      "id": "default",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    },
    {
      "id": "llama3-3b",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    },
    {
      "id": "qwen3-4b", 
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    }
  ]
}

Note: The "default" model is always listed first and represents the server’s default model. If aliases are configured, they will appear in the list while the canonical pipeline names remain accepted.

CLI Arguments

Use the multi-model subcommand with these options:

  • --config <PATH> (required): Path to the JSON configuration file
  • --default-model-id <ID> (optional): Default model ID for requests that don’t specify a model (alias or pipeline name)

New syntax:

mistralrs from-config --file <CONFIG>

Examples

Example 1: Text Models

{
  "llama3-3b": {
    "Plain": {
      "model_id": "meta-llama/Llama-3.2-3B-Instruct"
    }
  },
  "qwen3-4b": {
    "Plain": {
      "model_id": "Qwen/Qwen3-4B"
    },
    "in_situ_quant": "Q4K"
  }
}

Example 2: Mixed Model Types

{
  "text-model": {
    "Plain": {
      "model_id": "meta-llama/Llama-3.2-3B-Instruct"
    }
  },
  "vision-model": {
    "VisionPlain": {
      "model_id": "google/gemma-3-4b-it"
    }
  }
}

Example 3: GGUF Models

{
  "llama-gguf": {
    "GGUF": {
      "tok_model_id": "meta-llama/Llama-3.2-3B-Instruct",
      "quantized_model_id": "bartowski/Llama-3.2-3B-Instruct-GGUF",
      "quantized_filename": "Llama-3.2-3B-Instruct-Q4_K_M.gguf"
    }
  }
}

Model Unloading and Reloading

You can dynamically unload models to free memory and reload them on demand. This is useful for managing GPU memory when working with multiple large models.

Unload a Model

Unload a model from memory while preserving its configuration for later reload:

curl -X POST http://localhost:1234/v1/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "unloaded"
}

Reload a Model

Manually reload a previously unloaded model:

curl -X POST http://localhost:1234/v1/models/reload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "loaded"
}

Check Model Status

Get the current status of a specific model:

curl -X POST http://localhost:1234/v1/models/status \
  -H "Content-Type: application/json" \
  -d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'

Response:

{
  "model_id": "meta-llama/Llama-3.2-3B-Instruct",
  "status": "loaded"
}

Possible status values:

  • loaded: Model is loaded and ready
  • unloaded: Model is unloaded but can be reloaded
  • reloading: Model is currently being reloaded
  • not_found: Model ID not recognized
  • no_loader_config: Model cannot be reloaded (missing loader configuration)
  • internal_error: An internal error occurred

Auto-Reload

When a request is sent to an unloaded model, it will automatically reload before processing the request. This enables a “lazy loading” pattern where models are only loaded when needed.

List Models with Status

The /v1/models endpoint now includes status information:

curl http://localhost:1234/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "default",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local"
    },
    {
      "id": "meta-llama/Llama-3.2-3B-Instruct",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local",
      "status": "loaded"
    },
    {
      "id": "Qwen/Qwen3-4B",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local",
      "status": "unloaded"
    }
  ]
}

Rust SDK Usage

The mistralrs crate provides MultiModelBuilder for loading multiple models and Model methods for multi-model management.

Loading Multiple Models

By default, model IDs are the pipeline names (usually the HuggingFace model path, e.g., "google/gemma-3-4b-it"). You can provide custom aliases with add_model_with_alias for shorter IDs.

use mistralrs::{IsqType, MultiModelBuilder, TextModelBuilder, VisionModelBuilder, TextMessages, TextMessageRole};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Build a multi-model instance with a vision model and a text model
    // Use aliases for shorter model IDs in requests
    let model = MultiModelBuilder::new()
        .add_model_with_alias(
            "gemma-vision",
            VisionModelBuilder::new("google/gemma-3-4b-it")  // Vision model
                .with_isq(IsqType::Q4K)
                .with_logging(),
        )
        .add_model_with_alias(
            "qwen-text",
            TextModelBuilder::new("Qwen/Qwen3-4B")  // Text model
                .with_isq(IsqType::Q4K),
        )
        .with_default_model("gemma-vision")
        .build()
        .await?;

    // Send request to default model
    let messages = TextMessages::new()
        .add_message(TextMessageRole::User, "Hello!");
    let response = model.send_chat_request(messages).await?;

    // Send request to specific model using its alias
    let messages = TextMessages::new()
        .add_message(TextMessageRole::User, "Hello from Qwen!");
    let response = model.send_chat_request_with_model(messages, Some("qwen-text")).await?;

    Ok(())
}

Model Management Methods

#![allow(unused)]
fn main() {
// List all models (returns aliases if configured, otherwise pipeline names)
let models = model.list_models()?;

// Get/set default model
let default = model.get_default_model_id()?;
model.set_default_model_id("qwen-text")?;

// List models with status
let status = model.list_models_with_status()?;
// Returns Vec<(String, ModelStatus)> where ModelStatus is Loaded, Unloaded, or Reloading

// Check if a model is loaded
let is_loaded = model.is_model_loaded("gemma-vision")?;

// Unload a model to free memory
model.unload_model("gemma-vision")?;

// Reload when needed
model.reload_model("gemma-vision").await?;
}

Available _with_model Methods

All request methods have _with_model variants that accept an optional model ID:

  • send_chat_request_with_model(request, model_id: Option<&str>)
  • stream_chat_request_with_model(request, model_id: Option<&str>)
  • generate_image_with_model(..., model_id: Option<&str>)
  • generate_speech_with_model(prompt, model_id: Option<&str>)
  • generate_embeddings_with_model(request, model_id: Option<&str>)
  • tokenize_with_model(..., model_id: Option<&str>)
  • detokenize_with_model(..., model_id: Option<&str>)
  • config_with_model(model_id: Option<&str>)
  • max_sequence_length_with_model(model_id: Option<&str>)
  • re_isq_model_with_model(isq_type, model_id: Option<&str>)

When model_id is None, the default model is used. If aliases are configured, you can pass either the alias or the canonical pipeline name.

Python SDK Usage

The Python Runner class supports multi-model operations directly.

Basic Usage

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture, Architecture

# Create a runner with a vision model (Gemma 3 4B)
runner = Runner(
    which=Which.VisionPlain(
        model_id="google/gemma-3-4b-it",
        arch=VisionArchitecture.Gemma3,
    ),
    in_situ_quant="Q4K",
)

# Or create a runner with a text model (Qwen3 4B)
# runner = Runner(
#     which=Which.Plain(
#         model_id="Qwen/Qwen3-4B",
#         arch=Architecture.Qwen3,
#     ),
#     in_situ_quant="Q4K",
# )

# List models
models = runner.list_models()
print(f"Available models: {models}")

# Get/set default model
default = runner.get_default_model_id()
runner.set_default_model_id("google/gemma-3-4b-it")

# Send request with specific model_id
request = ChatCompletionRequest(
    messages=[{"role": "user", "content": "Hello!"}]
)
response = runner.send_chat_completion_request(request, model_id=models[0])

If aliases are configured (for example via the server config or Rust MultiModelBuilder), list_models() will return those aliases and you can pass them in model_id. The canonical pipeline names remain accepted.

Model Management

# List models with their status
status = runner.list_models_with_status()
# Returns list of (model_id, status) tuples

# Check if a model is loaded
is_loaded = runner.is_model_loaded("google/gemma-3-4b-it")

# Unload a model to free memory
runner.unload_model("google/gemma-3-4b-it")

# Reload when needed
runner.reload_model("google/gemma-3-4b-it")

Request Methods with model_id

All request methods accept an optional model_id parameter:

# Chat completion
response = runner.send_chat_completion_request(request, model_id="model-id")

# Completion
response = runner.send_completion_request(request, model_id="model-id")

# Embeddings
embeddings = runner.send_embedding_request(request, model_id="model-id")

# Image generation
image = runner.generate_image(prompt, response_format, model_id="model-id")

# Speech generation
audio = runner.generate_audio(prompt, model_id="model-id")

# Tokenization
tokens = runner.tokenize_text(text, add_special_tokens=True, model_id="model-id")
text = runner.detokenize_text(tokens, skip_special_tokens=True, model_id="model-id")

When model_id is None or omitted, the default model is used.

Migration Guide

From MultiModel (Rust)

The MultiModel struct has been removed. Use Model directly with MultiModelBuilder:

#![allow(unused)]
fn main() {
// Old (deprecated)
let multi = MultiModel::new(...);
multi.send_chat_request_to_model(request, "model-id").await?;

// New - model IDs are pipeline names by default (aliases optional)
let model = MultiModelBuilder::new()
    .add_model(VisionModelBuilder::new("google/gemma-3-4b-it"))
    .add_model(TextModelBuilder::new("Qwen/Qwen3-4B"))
    .build()
    .await?;
model.send_chat_request_with_model(request, Some("Qwen/Qwen3-4B")).await?;
}

From MultiModelRunner (Python)

The MultiModelRunner class has been removed. Use Runner directly:

# Old (deprecated)
multi_runner = MultiModelRunner(runner)
multi_runner.send_chat_completion_request_to_model(request, "model-id")

# New - model IDs are the registered IDs (aliases if configured)
runner = Runner(which=Which.Plain(model_id="google/gemma-3-4b-it", ...))
runner.send_chat_completion_request(request, model_id="google/gemma-3-4b-it")

MCP (Model Context Protocol) Client

mistral.rs includes a built-in MCP client that allows models to connect to external tools and services through the Model Context Protocol. This enables automatic tool discovery and usage from any MCP-compatible server.

Quick Start

Examples below show HTTP (Hugging Face), Process (filesystem), and WebSocket transports. Replace hf_xxx with your actual Hugging Face token for HTTP examples.

Rust SDK

use mistralrs::{
    TextModelBuilder, McpClientConfig, McpServerConfig, McpServerSource,
    TextMessages, TextMessageRole,
};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Process example (filesystem server - recommended for getting started)
    let mcp_config = McpClientConfig {
        servers: vec![McpServerConfig {
            name: "Filesystem Tools".to_string(),
            source: McpServerSource::Process {
                command: "npx".to_string(),
                args: vec!["@modelcontextprotocol/server-filesystem".to_string(), ".".to_string()],
                work_dir: None,
                env: None,
            },
            ..Default::default()
        }],
        auto_register_tools: true,
        ..Default::default()
    };

    // Alternative HTTP example (Hugging Face MCP server)
    let _mcp_config_http = McpClientConfig {
        servers: vec![McpServerConfig {
            id: "hf_server".to_string(),
            name: "Hugging Face MCP".to_string(),
            source: McpServerSource::Http {
                url: "https://hf.co/mcp".to_string(),
                timeout_secs: Some(30),
                headers: None,
            },
            enabled: false, // Disabled by default
            tool_prefix: Some("hf".to_string()),
            resources: None,
            bearer_token: Some("hf_xxx".to_string()), // Your HF token
        }],
        auto_register_tools: true,
        tool_timeout_secs: Some(30),
        max_concurrent_calls: Some(5),
    };

    // Alternative WebSocket example
    let _mcp_config_websocket = McpClientConfig {
        servers: vec![McpServerConfig {
            name: "WebSocket Example".to_string(),
            source: McpServerSource::WebSocket {
                url: "wss://api.example.com/mcp".to_string(),
                timeout_secs: Some(30),
                headers: None,
            },
            enabled: false, // Disabled by default
            ..Default::default()
        }],
        auto_register_tools: true,
        ..Default::default()
    };

    // Build model with MCP support
    let model = TextModelBuilder::new("Qwen/Qwen3-4B")
        .with_mcp_client(mcp_config)
        .build()
        .await?;

    // Use the model - tools are automatically available
    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::User,
            "List the files in the current directory and create a test.txt file"
        );

    let response = model.send_chat_request(messages).await?;
    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    
    Ok(())
}

Python SDK

import mistralrs

# Process example (filesystem server - recommended for getting started)
filesystem_server = mistralrs.McpServerConfigPy(
    name="Filesystem Tools",
    source=mistralrs.McpServerSourcePy.Process(
        command="npx",
        args=["@modelcontextprotocol/server-filesystem", "."],
        work_dir=None,
        env=None
    )
)

# Alternative HTTP example (Hugging Face MCP server)
hf_server = mistralrs.McpServerConfigPy(
    id="hf_server",
    name="Hugging Face MCP",
    source=mistralrs.McpServerSourcePy.Http(
        url="https://hf.co/mcp",
        timeout_secs=30,
        headers=None
    ),
    enabled=False,  # Disabled by default
    tool_prefix="hf",
    resources=None,
    bearer_token="hf_xxx"  # Your HF token
)

# Alternative WebSocket example
websocket_server = mistralrs.McpServerConfigPy(
    name="WebSocket Example",
    source=mistralrs.McpServerSourcePy.WebSocket(
        url="wss://api.example.com/mcp",
        timeout_secs=30,
        headers=None
    ),
    enabled=False  # Disabled by default
)

# Create MCP client config using filesystem server (others are disabled)
mcp_config = mistralrs.McpClientConfigPy(
    servers=[filesystem_server], # hf_server, websocket_server can be added when enabled
    auto_register_tools=True,
    tool_timeout_secs=30,
    max_concurrent_calls=5
)

# Build model with MCP support
runner = mistralrs.Runner(
    which=mistralrs.Which.Plain(
        model_id="Qwen/Qwen3-4B",
        arch=mistralrs.Architecture.Qwen3,
    ),
    mcp_client_config=mcp_config
)

# Use the model - tools are automatically available
res = runner.send_chat_completion_request(
    mistralrs.ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "List the files in the current directory and create a test.txt file"}
        ],
        max_tokens=500,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)

HTTP API

  1. Create mcp-config.json:

Process Example (Recommended for getting started):

{
  "servers": [{
    "name": "Filesystem Tools",
    "source": {
      "type": "Process",
      "command": "npx",
      "args": ["@modelcontextprotocol/server-filesystem", "."]
    }
  }],
  "auto_register_tools": true
}

Note: To install the filesystem server, run: npx @modelcontextprotocol/server-filesystem . -y

HTTP Example (Hugging Face MCP Server):

{
  "servers": [
    {
      "name": "Hugging Face MCP",
      "source": {
        "type": "Http",
        "url": "https://hf.co/mcp",
        "timeout_secs": 30
      },
      "bearer_token": "hf_xxx",
      "tool_prefix": "hf",
      "enabled": false
    },
    {
      "name": "Filesystem Tools",
      "source": {
        "type": "Process",
        "command": "npx",
        "args": ["@modelcontextprotocol/server-filesystem", "."]
      }
    }
  ],
  "auto_register_tools": true,
  "tool_timeout_secs": 30,
  "max_concurrent_calls": 5
}

WebSocket Example:

{
  "servers": [
    {
      "name": "WebSocket Example",
      "source": {
        "type": "WebSocket",
        "url": "wss://api.example.com/mcp",
        "timeout_secs": 30
      },
      "enabled": false
    },
    {
      "name": "Filesystem Tools",
      "source": {
        "type": "Process",
        "command": "npx",
        "args": ["@modelcontextprotocol/server-filesystem", "."]
      }
    }
  ],
  "auto_register_tools": true
}
  1. Start server with MCP:
mistralrs serve \
  -p 1234 \
  --mcp-config mcp-config.json \
  -m Qwen/Qwen3-4B
  1. Use the API:
curl -X POST http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "messages": [
      {"role": "user", "content": "List the files in the current directory and create a test.txt file"}
    ],
    "max_tokens": 500,
    "temperature": 0.1
  }'

Key Features

  • Automatic Tool Discovery: Tools are discovered from MCP servers at startup
  • Multi-Server Support: Connect to multiple MCP servers simultaneously
  • Transport Flexibility: HTTP, WebSocket, and Process transports supported
  • Authentication: Bearer token support for secure connections
  • Tool Prefixing: Avoid naming conflicts between servers
  • Concurrency Control: Limit parallel tool executions
  • Timeout Management: Control individual tool execution timeouts

Next Steps

Common MCP Servers

  • Filesystem: @modelcontextprotocol/server-filesystem - Local file operations (Process)
  • Hugging Face: https://hf.co/mcp - Access HF models, datasets, and spaces (HTTP)
  • Postgres: @modelcontextprotocol/server-postgres - Database operations (Process)

Additional servers (install separately):

Replace placeholder tokens and URLs with actual values for your use case.

Troubleshooting

Common Issues

“MCP server failed to start” or “npx command not found”

  • Install Node.js and npm: curl -fsSL https://deb.nodesource.com/setup_lts.x | sudo -E bash - && sudo apt-get install -y nodejs
  • Install the filesystem server: npx @modelcontextprotocol/server-filesystem . -y

“No tools available” or “tools_available: false”

  • Check server logs for MCP connection errors
  • Verify the MCP config file path is correct
  • Ensure the MCP server process is running: ps aux | grep mcp

“Tool call failed” or timeout errors

  • Increase tool_timeout_secs in your config (default: 30)
  • Check max_concurrent_calls setting (start with 1-5)
  • Verify file permissions for filesystem operations

Authentication errors with HTTP servers

  • Double-check bearer_token values (e.g., HF tokens start with hf_)
  • Verify API endpoints are accessible: curl -H "Authorization: Bearer YOUR_TOKEN" https://hf.co/mcp

Need help?

MCP protocol support

mistralrs serve can speak the MCP – Model-Control-Protocol in addition to the regular OpenAI-compatible REST API.

At a high-level, MCP is an opinionated, tool-based JSON-RPC 2.0 protocol that lets clients interact with models through structured tool calls instead of specialised HTTP routes.
The implementation in Mistral.rs is powered by rust-mcp-sdk and automatically registers tools based on the modalities supported by the loaded model (text, vision, …).

Exposed tools:

ToolMinimum input -> output modalitiesDescription
chatTextTextWraps the OpenAI /v1/chat/completions endpoint

ToC


Running

Start the normal HTTP server and add the --mcp-port flag to expose an MCP endpoint in parallel on a separate port:

mistralrs serve \
  -p 1234 \
  --mcp-port 4321 \
  -m mistralai/Mistral-7B-Instruct-v0.3

Check if it’s working

The following curl command lists the tools advertised by the server and therefore serves as a quick smoke-test:

curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/list",
  "params": {}
}'      

Example clients

Python

The reference Python SDK can be installed via:

pip install --upgrade mcp

Here is a minimal end-to-end example that initialises a session, lists the available tools and finally sends a chat request:

import asyncio

from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client


SERVER_URL = "http://localhost:4321/mcp"


async def main() -> None:
    # The helper creates an SSE (Server-Sent-Events) transport under the hood
    async with streamablehttp_client(SERVER_URL) as (read, write, _):
        async with ClientSession(read, write) as session:

            # --- INITIALIZE ---
            init_result = await session.initialize()
            print("Server info:", init_result.serverInfo)

            # --- LIST TOOLS ---
            tools = await session.list_tools()
            print("Available tools:", [t.name for t in tools.tools])

            # --- CALL TOOL ---
            resp = await session.call_tool(
                "chat",
                arguments={
                    "messages": [
                        {"role": "user", "content": "Hello MCP 👋"},
                        {"role": "assistant", "content": "Hi there!"}
                    ],
                    "maxTokens": 50,
                    "temperature": 0.7,
                },
            )
            # resp.content is a list[CallToolResultContentItem]; extract text parts
            text = "\n".join(c.text for c in resp.content if c.type == "text")
            print("Model replied:", text)

if __name__ == "__main__":
    asyncio.run(main())

Rust

use anyhow::Result;
use rust_mcp_sdk::{
    mcp_client::client_runtime,
    schema::{
        CallToolRequestParams, ClientCapabilities, CreateMessageRequest,
        Implementation, InitializeRequestParams, Message, LATEST_PROTOCOL_VERSION,
    },
    ClientSseTransport, ClientSseTransportOptions,
};

struct Handler;
#[async_trait::async_trait]
impl rust_mcp_sdk::mcp_client::ClientHandler for Handler {}

#[tokio::main]
async fn main() -> Result<()> {
    let transport = ClientSseTransport::new(
        "http://localhost:4321/mcp",
        ClientSseTransportOptions::default(),
    )?;

    let details = InitializeRequestParams {
        capabilities: ClientCapabilities::default(),
        client_info: Implementation { name: "mcp-client".into(), version: "0.1".into() },
        protocol_version: LATEST_PROTOCOL_VERSION.into(),
    };

    let client = client_runtime::create_client(details, transport, Handler);
    client.clone().start().await?;

    let req = CreateMessageRequest {
        model: "mistralai/Mistral-7B-Instruct-v0.3".into(),
        messages: vec![Message::user("Explain Rust ownership.")],
        ..Default::default()
    };

    let result = client
        .call_tool(CallToolRequestParams::new("chat", req.into()))
        .await?;

    println!("{}", result.content[0].as_text_content()?.text);
    client.shut_down().await?;
    Ok(())
}

HTTP

Call a tool:

curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "chat",
    "arguments": {
    "messages": [
      { "role": "system",    "content": "You are a helpful assistant." },
      { "role": "user",      "content": "Hello, what’s the time?" }
    ],
    "maxTokens": 50,
    "temperature": 0.7
  }
  }
}'

Initialize:

curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "initialize",
  "params": {}
}'         

List tools:

curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/list",
  "params": {}
}'      

Limitations & roadmap

The MCP support that ships with the current Mistral.rs release focuses on the happy-path. A few niceties have not yet been implemented and PRs are more than welcome:

  1. Streaming token responses (similar to the stream=true flag in the OpenAI API).
  2. An authentication layer – if you are exposing the MCP port publicly run it behind a reverse-proxy that handles auth (e.g. nginx + OIDC).
  3. Additional tools for other modalities such as vision or audio once the underlying crates stabilise.

If you would like to work on any of the above please open an issue first so the work can be coordinated.

MCP Configuration Reference

This page provides a complete reference for configuring the MCP client in mistral.rs.

Quick Start - Minimal Configuration

For simple use cases, you can now use a minimal configuration that leverages smart defaults:

{
  "servers": [{
    "name": "Hugging Face MCP Server",
    "source": {
      "type": "Http",
      "url": "https://hf.co/mcp"
    },
    "bearer_token": "hf_xxx"
  }]
}

This automatically provides:

  • UUID-based server ID: Unique identifier generated automatically
  • Enabled by default: Server is active without explicit enabled: true
  • UUID-based tool prefix: Prevents naming conflicts automatically
  • No timeouts: Tools and connections don’t timeout by default
  • Sequential execution: Only 1 concurrent tool call to prevent overwhelming servers
  • Auto-registration: Tools are automatically discovered and registered

Configuration Structure

McpClientConfig

The top-level configuration for the MCP client:

{
  "servers": [...],                    // Array of MCP server configurations
  "auto_register_tools": true,         // Automatically register discovered tools (default: true)
  "tool_timeout_secs": null,           // Timeout for individual tool calls, null = no timeout (default: null)
  "max_concurrent_calls": 1            // Maximum concurrent tool executions (default: 1)
}

McpServerConfig

Configuration for each MCP server:

{
  "id": "unique_id",                  // Unique identifier (default: UUID if not specified)
  "name": "Display Name",             // Human-readable name
  "source": {...},                    // Transport configuration (see below)
  "enabled": true,                    // Enable/disable this server (default: true)
  "tool_prefix": "mcp_abc123",         // Prefix for tool names (default: UUID-based if not specified)
  "resources": ["pattern"],           // Optional resource patterns
  "bearer_token": "token"             // Optional authentication token
}

Transport Source Configuration

HTTP Transport

{
  "type": "Http",
  "url": "https://api.example.com/mcp",
  "timeout_secs": null,               // Optional, null = no timeout (default)
  "headers": {                        // Optional custom headers
    "X-API-Version": "v1",
    "User-Agent": "mistral-rs/0.6.0"
  }
}

WebSocket Transport

{
  "type": "WebSocket", 
  "url": "wss://realtime.example.com/mcp",
  "timeout_secs": null,               // Optional, null = no timeout (default)
  "headers": {                        // Optional WebSocket headers
    "Origin": "https://mistral.rs",
    "Sec-WebSocket-Protocol": "mcp"
  }
}

Process Transport

{
  "type": "Process",
  "command": "mcp-server-filesystem",
  "args": ["--root", "/tmp"],         // Command arguments
  "work_dir": "/home/user",           // Optional working directory
  "env": {                            // Optional environment variables
    "MCP_LOG_LEVEL": "info"
  }
}

Field Reference

McpClientConfig Fields

FieldTypeRequiredDefaultDescription
serversArrayYes-List of MCP server configurations
auto_register_toolsBooleanNotrueAutomatically discover and register tools at startup
tool_timeout_secsIntegerNonullTimeout in seconds for individual tool calls (null = no timeout)
max_concurrent_callsIntegerNo1Maximum number of concurrent tool executions

McpServerConfig Fields

FieldTypeRequiredDefaultDescription
idStringNoUUIDUnique identifier for the server (UUID generated if not provided)
nameStringYes-Human-readable server name
sourceObjectYes-Transport configuration
enabledBooleanNotrueWhether to connect to this server
tool_prefixStringNoUUID-basedPrefix to add to all tool names (UUID-based if not provided)
resourcesArrayNoNoneResource URI patterns to subscribe to
bearer_tokenStringNoNoneBearer token for authentication

Transport Source Fields

HTTP Source

FieldTypeRequiredDefaultDescription
typeStringYes-Must be “Http”
urlStringYes-HTTP/HTTPS URL of the MCP server
timeout_secsIntegerNonullRequest timeout in seconds (null = no timeout)
headersObjectNoNoneAdditional HTTP headers

WebSocket Source

FieldTypeRequiredDefaultDescription
typeStringYes-Must be “WebSocket”
urlStringYes-WS/WSS URL of the MCP server
timeout_secsIntegerNonullConnection timeout in seconds (null = no timeout)
headersObjectNoNoneWebSocket handshake headers

Process Source

FieldTypeRequiredDefaultDescription
typeStringYes-Must be “Process”
commandStringYes-Executable command to run
argsArrayNo[]Command line arguments
work_dirStringNoCurrent dirWorking directory
envObjectNoNoneEnvironment variables

Authentication

Bearer Token

The bearer_token field is automatically added as an Authorization: Bearer <token> header for HTTP and WebSocket connections.

{
  "bearer_token": "hf_AbCdEfGhIjKlMnOpQrStUvWxYz"
}

Custom Headers

For other authentication schemes, use the headers field:

{
  "source": {
    "type": "Http",
    "url": "https://api.example.com/mcp",
    "headers": {
      "X-API-Key": "your-api-key",
      "X-Client-ID": "your-client-id"
    }
  }
}

Tool Naming

Without Prefix

Tools are registered with their original names:

  • MCP tool: search → Registered as: search

With Prefix

When tool_prefix is set, all tools from that server get prefixed:

  • MCP tool: search with prefix web → Registered as: web_search

This prevents conflicts when multiple servers provide tools with the same name.

Resource Patterns

The resources field accepts glob-like patterns:

{
  "resources": [
    "file://**/*.txt",      // All .txt files
    "file://data/**",       // Everything under data/
    "db://users/*",         // All user records
    "api://v1/metrics"      // Specific endpoint
  ]
}

Environment Variables

Using Environment Variables in Configuration

While JSON doesn’t support environment variables directly, you can use them when building configurations programmatically:

#![allow(unused)]
fn main() {
McpServerConfig {
    bearer_token: std::env::var("HF_TOKEN").ok(),
    source: McpServerSource::Http {
        url: std::env::var("MCP_SERVER_URL")
            .unwrap_or_else(|_| "https://hf.co/mcp".to_string()),
        // ...
    },
    // ...
}
}
import os

McpServerConfigPy(
    bearer_token=os.getenv("HF_TOKEN"),
    source=McpServerSourcePy.Http(
        url=os.getenv("MCP_SERVER_URL", "https://hf.co/mcp")
    )
)
VariableDescription
MCP_CONFIG_PATHPath to MCP configuration file
MCP_LOG_LEVELLogging level for MCP operations
MCP_POOL_SIZEConnection pool size for HTTP/WebSocket

Validation Rules

  1. Unique Server IDs: All server id values must be unique
  2. Valid URLs: HTTP URLs must start with http:// or https://
  3. Valid WebSocket URLs: Must start with ws:// or wss://
  4. Executable Commands: Process commands must be executable
  5. Tool Name Conflicts: Use tool_prefix to avoid conflicts

Example Configurations

Single Server (Hugging Face) - Minimal

{
  "servers": [{
    "name": "Hugging Face MCP Server",
    "source": {
      "type": "Http",
      "url": "https://hf.co/mcp"
    },
    "bearer_token": "hf_xxx"
  }]
}

Single Server (Hugging Face) - Full Configuration

{
  "servers": [{
    "id": "hf",
    "name": "Hugging Face MCP",
    "source": {
      "type": "Http",
      "url": "https://hf.co/mcp",
      "timeout_secs": 30
    },
    "enabled": true,
    "tool_prefix": "hf",
    "bearer_token": "hf_xxx"
  }],
  "auto_register_tools": true,
  "tool_timeout_secs": 30,
  "max_concurrent_calls": 5
}

Multi-Server Setup

{
  "servers": [
    {
      "id": "hf",
      "name": "Hugging Face",
      "source": {"type": "Http", "url": "https://hf.co/mcp"},
      "tool_prefix": "hf",
      "bearer_token": "hf_xxx"
    },
    {
      "id": "github",
      "name": "GitHub API",
      "source": {"type": "Http", "url": "https://api.github.com/mcp"},
      "tool_prefix": "gh",
      "bearer_token": "ghp_xxx"
    },
    {
      "id": "local_fs",
      "name": "Filesystem",
      "source": {
        "type": "Process",
        "command": "mcp-server-filesystem",
        "args": ["--root", "/data", "--readonly"]
      },
      "tool_prefix": "fs"
    }
  ],
  "auto_register_tools": true,
  "tool_timeout_secs": 30,
  "max_concurrent_calls": 10
}

MCP Transport Types

mistral.rs supports three transport types for connecting to MCP servers, each optimized for different use cases.

HTTP Transport

Best for public APIs, RESTful services, and servers behind load balancers.

Configuration

{
  "source": {
    "type": "Http",
    "url": "https://api.example.com/mcp",
    "timeout_secs": 30,
    "headers": {
      "X-API-Version": "v1",
      "User-Agent": "mistral-rs/0.6.0"
    }
  },
  "bearer_token": "your-api-token"
}

Features

  • Server-Sent Events (SSE) support for streaming responses
  • Custom headers for API versioning or client identification
  • Bearer token authentication (added as Authorization: Bearer <token>)
  • Configurable timeouts
  • Standard HTTP semantics

Example: Hugging Face MCP

#![allow(unused)]
fn main() {
McpServerSource::Http {
    url: "https://hf.co/mcp".to_string(),
    timeout_secs: Some(30),
    headers: None,
}
}

WebSocket Transport

Best for real-time applications, bidirectional communication, and low-latency requirements.

Configuration

{
  "source": {
    "type": "WebSocket",
    "url": "wss://realtime.example.com/mcp",
    "timeout_secs": 60,
    "headers": {
      "Origin": "https://mistral.rs",
      "Sec-WebSocket-Protocol": "mcp"
    }
  },
  "bearer_token": "your-websocket-token"
}

Features

  • Persistent connections reduce handshake overhead
  • Server-initiated notifications
  • Lower latency for frequent tool calls
  • Automatic reconnection handling
  • WebSocket-specific headers support

Example: Real-time Data Feed

#![allow(unused)]
fn main() {
McpServerSource::WebSocket {
    url: "wss://data.example.com/mcp".to_string(),
    timeout_secs: Some(60),
    headers: Some(headers),
}
}

Process Transport

Best for local tools, development servers, and sandboxed environments.

Configuration

{
  "source": {
    "type": "Process",
    "command": "mcp-server-filesystem",
    "args": ["--root", "/tmp", "--readonly"],
    "work_dir": "/home/user/workspace",
    "env": {
      "MCP_LOG_LEVEL": "info",
      "MCP_TIMEOUT": "30"
    }
  }
}

Features

  • No network overhead
  • Process isolation for security
  • Direct stdin/stdout communication
  • Environment variable configuration
  • Working directory control
  • No authentication needed (process inherits permissions)

Example: Filesystem Server

#![allow(unused)]
fn main() {
McpServerSource::Process {
    command: "mcp-server-filesystem".to_string(),
    args: vec!["--root".to_string(), "/tmp".to_string()],
    work_dir: None,
    env: None,
}
}

Transport Selection Guide

Use CaseRecommended TransportWhy
Public APIsHTTPStandard auth, caching, load balancing
Local toolsProcessNo network, process isolation
Real-time dataWebSocketLow latency, server push
Corporate proxiesHTTPProxy support, standard ports
DevelopmentProcessEasy debugging, no network setup
Interactive appsWebSocketBidirectional, persistent connection

Security Considerations

HTTP

  • Always use HTTPS in production
  • Bearer tokens transmitted with each request
  • Consider token rotation strategies

WebSocket

  • Use WSS (WebSocket Secure) in production
  • Bearer token sent during handshake
  • Connection persists with authenticated state

Process

  • Inherits user permissions
  • Sandboxing via work_dir and env
  • No network exposure

Performance Tips

  1. HTTP: Enable keep-alive, use connection pooling
  2. WebSocket: Reuse connections, handle reconnection gracefully
  3. Process: Minimize startup time, use long-running processes

Error Handling

All transports implement automatic retry with exponential backoff:

  • Initial retry: 1 second
  • Max retry: 60 seconds
  • Max attempts: 5

Custom retry behavior can be configured per server.

Advanced MCP Usage

This guide covers advanced MCP client configurations and usage patterns.

Multi-Server Configuration

Connect to multiple MCP servers simultaneously to access different tool sets:

#![allow(unused)]
fn main() {
let mcp_config = McpClientConfig {
    servers: vec![
        // Hugging Face for ML tools
        McpServerConfig {
            id: "hf_server".to_string(),
            name: "Hugging Face MCP".to_string(),
            source: McpServerSource::Http {
                url: "https://hf.co/mcp".to_string(),
                timeout_secs: Some(30),
                headers: None,
            },
            enabled: true,
            tool_prefix: Some("hf".to_string()),
            resources: None,
            bearer_token: Some("hf_xxx".to_string()),
        },
        // Local filesystem access
        McpServerConfig {
            id: "fs_server".to_string(),
            name: "Filesystem MCP".to_string(),
            source: McpServerSource::Process {
                command: "mcp-server-filesystem".to_string(),
                args: vec!["--root".to_string(), "/data".to_string()],
                work_dir: None,
                env: None,
            },
            enabled: true,
            tool_prefix: Some("fs".to_string()),
            resources: Some(vec!["file://**".to_string()]),
            bearer_token: None,
        },
        // GitHub API access
        McpServerConfig {
            id: "github_server".to_string(),
            name: "GitHub MCP".to_string(),
            source: McpServerSource::Http {
                url: "https://api.github.com/mcp".to_string(),
                timeout_secs: Some(45),
                headers: Some(HashMap::from([
                    ("Accept".to_string(), "application/vnd.github.v3+json".to_string()),
                ])),
            },
            enabled: true,
            tool_prefix: Some("gh".to_string()),
            resources: None,
            bearer_token: Some("ghp_xxx".to_string()),
        },
    ],
    auto_register_tools: true,
    tool_timeout_secs: Some(30),
    max_concurrent_calls: Some(10),
};
}

Tool Prefixing Strategy

When using multiple servers, tool prefixes prevent naming conflicts:

{
  "servers": [
    {
      "id": "server1",
      "tool_prefix": "s1",
      // Tool "search" becomes "s1_search"
    },
    {
      "id": "server2", 
      "tool_prefix": "s2",
      // Tool "search" becomes "s2_search"
    }
  ]
}

Custom Headers and Authentication

API Key in Headers

#![allow(unused)]
fn main() {
let mut headers = HashMap::new();
headers.insert("X-API-Key".to_string(), "your-api-key".to_string());
headers.insert("X-Client-Version".to_string(), "1.0.0".to_string());

McpServerSource::Http {
    url: "https://api.example.com/mcp".to_string(),
    timeout_secs: Some(30),
    headers: Some(headers),
}
}

OAuth2 Bearer Token

#![allow(unused)]
fn main() {
McpServerConfig {
    // ...
    bearer_token: Some("your-oauth2-token".to_string()),
    // Automatically added as: Authorization: Bearer your-oauth2-token
}
}

Resource Subscriptions

Subscribe to specific resource patterns from MCP servers:

#![allow(unused)]
fn main() {
McpServerConfig {
    id: "data_server".to_string(),
    // ...
    resources: Some(vec![
        "file://data/**/*.json".to_string(),  // All JSON files in data/
        "db://users/*".to_string(),            // All user records
        "api://v1/metrics".to_string(),        // Specific API endpoint
    ]),
    // ...
}
}

Concurrency and Rate Limiting

Global Concurrency Control

#![allow(unused)]
fn main() {
McpClientConfig {
    // ...
    max_concurrent_calls: Some(5),  // Max 5 tools executing simultaneously
}
}

Per-Tool Timeouts

#![allow(unused)]
fn main() {
McpClientConfig {
    // ...
    tool_timeout_secs: Some(30),  // Each tool call times out after 30s
}
}

Custom Rate Limiting

# Python example with custom rate limiting
import time
from collections import deque

class RateLimitedMcpRunner:
    def __init__(self, runner, max_calls_per_minute=60):
        self.runner = runner
        self.max_calls = max_calls_per_minute
        self.call_times = deque()
    
    def send_chat_completion_request(self, request):
        # Remove calls older than 1 minute
        now = time.time()
        while self.call_times and self.call_times[0] < now - 60:
            self.call_times.popleft()
        
        # Check rate limit
        if len(self.call_times) >= self.max_calls:
            sleep_time = 60 - (now - self.call_times[0])
            time.sleep(sleep_time)
        
        # Make the call
        self.call_times.append(now)
        return self.runner.send_chat_completion_request(request)

Environment-Specific Configuration

Development vs Production

#![allow(unused)]
fn main() {
let mcp_config = if cfg!(debug_assertions) {
    McpClientConfig {
        servers: vec![/* development servers */],
        tool_timeout_secs: Some(60),  // Longer timeouts for debugging
        max_concurrent_calls: Some(1), // Sequential execution for debugging
        // ...
    }
} else {
    McpClientConfig {
        servers: vec![/* production servers */],
        tool_timeout_secs: Some(10),   // Strict timeouts
        max_concurrent_calls: Some(20), // Higher concurrency
        // ...
    }
};
}

Environment Variables

#![allow(unused)]
fn main() {
let mcp_config = McpClientConfig {
    servers: vec![
        McpServerConfig {
            // ...
            bearer_token: std::env::var("HF_TOKEN").ok(),
            source: McpServerSource::Http {
                url: std::env::var("MCP_SERVER_URL")
                    .unwrap_or_else(|_| "https://hf.co/mcp".to_string()),
                // ...
            },
            // ...
        },
    ],
    // ...
};
}

Error Handling and Fallbacks

Graceful Degradation

#![allow(unused)]
fn main() {
let mcp_config = McpClientConfig {
    servers: vec![
        // Primary server
        McpServerConfig {
            id: "primary".to_string(),
            enabled: true,
            // ...
        },
        // Fallback server
        McpServerConfig {
            id: "fallback".to_string(),
            enabled: check_primary_health().is_err(),
            // ...
        },
    ],
    // ...
};
}

Tool-Specific Error Handling

# Handle specific tool errors
try:
    response = runner.send_chat_completion_request(request)
except Exception as e:
    if "tool_timeout" in str(e):
        print("Tool execution timed out, trying with longer timeout...")
        # Retry with extended timeout
    elif "tool_not_found" in str(e):
        print("Tool not available, falling back to built-in response...")
        # Fallback logic

Monitoring and Debugging

Enable Debug Logging

#![allow(unused)]
fn main() {
std::env::set_var("RUST_LOG", "mistralrs_mcp=debug");
env_logger::init();
}

Tool Call Inspection

#![allow(unused)]
fn main() {
let response = model.send_chat_request(messages).await?;

// Check if tools were called
if let Some(tool_calls) = &response.choices[0].message.tool_calls {
    for call in tool_calls {
        println!("Tool: {}", call.function.name);
        println!("Args: {}", call.function.arguments);
        println!("ID: {}", call.id);
    }
}
}

Performance Optimization

Connection Pooling

HTTP and WebSocket transports automatically use connection pooling. Configure pool size:

#![allow(unused)]
fn main() {
// Set via environment variable
std::env::set_var("MCP_POOL_SIZE", "10");
}

Caching Tool Responses

from functools import lru_cache
import json

@lru_cache(maxsize=100)
def cached_tool_call(tool_name, args_json):
    args = json.loads(args_json)
    # Tool execution logic
    return result

# Use with MCP tools that have deterministic outputs

Security Best Practices

  1. Token Rotation: Implement automatic token refresh for long-running applications
  2. Least Privilege: Only enable required tools and resources
  3. Audit Logging: Log all tool calls for security monitoring
  4. Network Isolation: Use Process transport for sensitive local operations
  5. Input Validation: MCP servers should validate all tool inputs

Configuration Reference

This document covers environment variables and server configuration for mistral.rs.

Runtime Environment Variables

VariableDescription
MISTRALRS_DEBUG=1Enable debug mode: outputs tensor info files for GGUF/GGML models, increases logging verbosity
MISTRALRS_NO_MMAP=1Disable memory-mapped file loading, forcing all tensor data into memory
MISTRALRS_NO_MLA=1Disable MLA (Multi-head Latent Attention) optimization for DeepSeek V2/V3 and GLM-4.7-Flash
MISTRALRS_ISQ_SINGLETHREAD=1Force ISQ (In-Situ Quantization) to run single-threaded
MCP_CONFIG_PATHFallback path for MCP client configuration (used if --mcp-config not provided)
KEEP_ALIVE_INTERVALSSE keep-alive interval in milliseconds (default: 10000)
HF_HUB_CACHEOverride Hugging Face Hub cache directory

Build-Time Environment Variables

VariableDescription
MISTRALRS_METAL_PRECOMPILE=0Skip Metal kernel precompilation (useful for CI)
NVCC_CCBINSet CUDA compiler path
CUDA_NVCC_FLAGS=-fPIERequired on some Linux distributions
CUDA_COMPUTE_CAPOverride CUDA compute capability (e.g., “80” for RTX 3090)

Server Defaults

When running the HTTP server with mistralrs serve, these defaults apply:

SettingDefault Value
Server IP0.0.0.0 (all interfaces)
Max request body50 MB
Max running sequences16
Prefix cache count16
SSE keep-alive10 seconds
PagedAttention (CUDA)Enabled
PagedAttention (Metal)Disabled
PA GPU memory usage90% of free memory
PA block size32 tokens

Multi-Node Distributed Configuration

For multi-node setups, configure the head node and workers using environment variables.

Head Node

VariableDescription
MISTRALRS_MN_GLOBAL_WORLD_SIZETotal number of devices across all nodes
MISTRALRS_MN_HEAD_NUM_WORKERSNumber of worker nodes
MISTRALRS_MN_HEAD_PORTPort for head node communication

Worker Nodes

VariableDescription
MISTRALRS_MN_WORKER_SERVER_ADDRAddress of head server to connect to
MISTRALRS_MN_WORKER_IDThis worker’s ID
MISTRALRS_MN_LOCAL_WORLD_SIZENumber of GPUs on this node
MISTRALRS_NO_NCCL=1Disable NCCL (use alternative backend)

See Also

Engine Internals

This document describes internal engine behaviors in mistral.rs.

Overview

The mistral.rs engine manages model inference through a background thread pool. Each loaded model runs in its own engine thread, which handles request queuing, batching, and execution.

Warmup Run

When a text or vision model is loaded in a multi-threaded runtime, mistral.rs automatically performs a warmup (“dummy”) run:

  • Sends a short completion request (“hello” with max 1 token) to initialize CUDA kernels and caches
  • Logs “Beginning dummy run.” when starting and “Dummy run completed in Xs.” when finished
  • Helps ensure more consistent performance for the first real user request
  • Only runs for text and vision models (not diffusion/speech)

This warmup ensures that CUDA kernel compilation and memory allocation happens during model loading rather than during the first user request.

Automatic Engine Recovery

If the inference engine thread dies unexpectedly (e.g., due to a panic), mistral.rs can automatically recover:

  • Detects dead engine threads when sending requests
  • Automatically reboots the engine using saved configuration
  • Logs “Engine {model_id} is dead, rebooting” followed by “Successfully rebooted engine {model_id}”
  • Preserves all original configuration including KV cache settings, prefix cache, and tool callbacks

This ensures high availability without manual intervention.

Thread Model

Each model loaded in mistral.rs runs in its own dedicated engine thread:

  1. Main Thread: Handles HTTP requests, CLI interaction, and dispatches work to engine threads
  2. Engine Threads: Each loaded model has a dedicated thread for inference
  3. Background Workers: Tokenization and other preprocessing can run in parallel

For multi-model setups, each model gets its own engine thread, allowing true parallel inference across different models.

See Also