Introduction - mistral.rs Documentation

Quick Links

I want to…	Go to…
Install mistral.rs	Installation Guide
Understand cargo features	Cargo Features
Run a model	CLI Reference
Use the HTTP API	HTTP Server
Fix an error	Troubleshooting
Configure environment	Configuration
Check model support	Supported Models

Getting Started

Installation Guide - Install mistral.rs on your system
Cargo Features - Complete cargo features reference
CLI Reference - Complete CLI command reference
CLI TOML Configuration - Configure via TOML files
Troubleshooting - Common issues and solutions

SDKs & APIs

Python SDK - Python package documentation
Python Installation - Python SDK installation guide
Rust SDK - Rust crate documentation
HTTP Server - OpenAI-compatible HTTP API
OpenResponses API - Stateful conversation API

Models

By Category

Supported Models - Complete model list and compatibility
Vision Models - Vision model overview
Image Generation - Diffusion models
Embeddings - Embedding model overview

Model-Specific Guides

Click to expand model guides

Text Models:

DeepSeek V2 | DeepSeek V3
Gemma 2 | Gemma 3 | Gemma 3n
GLM4 | GLM-4.7-Flash | GLM-4.7
Qwen 3 | Qwen 3 Next | SmolLM3 | GPT-OSS

Vision Models:

Idefics 2 | Idefics 3
LLaVA | Llama 3.2 Vision | Llama 4
MiniCPM-O 2.6 | Mistral 3
Phi 3.5 MoE | Phi 3.5 Vision | Phi 4 Multimodal
Qwen 2-VL | Qwen 3 VL | Qwen 3.5

Other Models:

Quantization & Optimization

Quantization Overview - All supported quantization methods
ISQ (In-Situ Quantization) - Quantize models at load time
UQFF Format - Pre-quantized model format | Layout
Topology - Per-layer quantization and device mapping
Importance Matrix - Improve ISQ accuracy

Adapters & Model Customization

Adapter Models - LoRA and X-LoRA support
LoRA/X-LoRA Examples
Non-Granular Scalings - X-LoRA optimization
AnyMoE - Create MoE models from dense models
MatFormer - Dynamic model sizing

Performance & Hardware

Device Mapping - Multi-GPU and CPU offloading
PagedAttention - Efficient KV cache management
Speculative Decoding - Accelerate generation with draft models
Flash Attention - Accelerated attention
MLA - Multi-head Latent Attention
Distributed Inference
- NCCL Backend
- Ring Backend

Features

Tool Calling - Function calling support
Web Search - Integrated web search
Chat Templates - Template customization
Sampling Options - Generation parameters
TOML Selector - Model selection syntax
Multi-Model Support - Load multiple models

MCP (Model Context Protocol)

MCP Client - Connect to external tools
MCP Server - Serve models over MCP
MCP Configuration
MCP Transports
MCP Advanced Usage

Reference

Configuration - Environment variables and server defaults
Engine Internals - Engine behaviors and recovery
Supported Models - Complete compatibility tables

Contributing

See the main README for contribution guidelines.