
Quick Links
| I want to… | Go to… |
|---|---|
| Install mistral.rs | Installation Guide |
| Understand cargo features | Cargo Features |
| Run a model | CLI Reference |
| Use the HTTP API | HTTP Server |
| Fix an error | Troubleshooting |
| Configure environment | Configuration |
| Check model support | Supported Models |
Getting Started
- Installation Guide - Install mistral.rs on your system
- Cargo Features - Complete cargo features reference
- CLI Reference - Complete CLI command reference
- CLI TOML Configuration - Configure via TOML files
- Troubleshooting - Common issues and solutions
SDKs & APIs
- Python SDK - Python package documentation
- Python Installation - Python SDK installation guide
- Rust SDK - Rust crate documentation
- HTTP Server - OpenAI-compatible HTTP API
- OpenResponses API - Stateful conversation API
Models
By Category
- Supported Models - Complete model list and compatibility
- Vision Models - Vision model overview
- Image Generation - Diffusion models
- Embeddings - Embedding model overview
Model-Specific Guides
Click to expand model guides
Text Models:
- DeepSeek V2 | DeepSeek V3
- Gemma 2 | Gemma 3 | Gemma 3n
- GLM4 | GLM-4.7-Flash | GLM-4.7
- Qwen 3 | SmolLM3 | GPT-OSS
Vision Models:
- Idefics 2 | Idefics 3
- LLaVA | Llama 3.2 Vision | Llama 4
- MiniCPM-O 2.6 | Mistral 3
- Phi 3.5 MoE | Phi 3.5 Vision | Phi 4 Multimodal
- Qwen 2-VL | Qwen 3 VL
Other Models:
Quantization & Optimization
- Quantization Overview - All supported quantization methods
- ISQ (In-Situ Quantization) - Quantize models at load time
- UQFF Format - Pre-quantized model format | Layout
- Topology - Per-layer quantization and device mapping
- Importance Matrix - Improve ISQ accuracy
Adapters & Model Customization
- Adapter Models - LoRA and X-LoRA support
- LoRA/X-LoRA Examples
- Non-Granular Scalings - X-LoRA optimization
- AnyMoE - Create MoE models from dense models
- MatFormer - Dynamic model sizing
Performance & Hardware
- Device Mapping - Multi-GPU and CPU offloading
- PagedAttention - Efficient KV cache management
- Speculative Decoding - Accelerate generation with draft models
- Flash Attention - Accelerated attention
- MLA - Multi-head Latent Attention
- Distributed Inference
Features
- Tool Calling - Function calling support
- Web Search - Integrated web search
- Chat Templates - Template customization
- Sampling Options - Generation parameters
- TOML Selector - Model selection syntax
- Multi-Model Support - Load multiple models
MCP (Model Context Protocol)
- MCP Client - Connect to external tools
- MCP Server - Serve models over MCP
- MCP Configuration
- MCP Transports
- MCP Advanced Usage
Reference
- Configuration - Environment variables and server defaults
- Engine Internals - Engine behaviors and recovery
- Supported Models - Complete compatibility tables
Contributing
See the main README for contribution guidelines.