Skip to content
mistral.rs
Search
Ctrl
K
Cancel
GitHub
Discord
Select theme
Dark
Light
Auto
Quickstart
User Guide
Serving
Serve an OpenAI-compatible API
Anthropic Messages API
Get structured output
Use the built-in web UI
Serve multiple models from one process
Use Codex and Claude Code
Models
Run any model
Model family notes
Send images, audio, and video
Set up video input
Speech models
Generate images with diffusion models
Use embedding models
Block-diffusion models
Quantization
Quantize a model
Use pre-quantized UQFF models
Online calibration
Agents & tools
Agents & tools
Build an agent
Tool calling
Enable code execution
Web search
Permissions & approvals
Agentic runtime for apps
Sessions
Connect to an MCP server
Expose mistralrs as an MCP server
Python SDK
Python SDK
Python SDK getting started
Stream tokens from Python
Rust SDK
Rust SDK
Rust SDK getting started
Stream chat responses from Rust
Embed mistralrs inside an Axum application
Customize
Customize
Chat templates
Sampling parameters
LoRA and X-LoRA adapters
Performance & scaling
Paged attention
Speculative decoding (MTP)
Distributed inference
Configure model topology
Throughput tuning
Deploy
Run mistralrs in Docker
Production checklist
Examples
Examples
python
agentic_tools
anymoe
anymoe_inference
anymoe_lora
code_execution
code_execution_approval
custom_search
custom_tool_call
deepseekr1
deepseekv2
dia
diffusion_gemma
embedding_gemma
flux
gemma3
gemma3n
gemma4
gguf
glm4_moe_lite
gpt_oss
granite
idefics2
idefics3
imatrix
isq
json_schema
lark
lark_llg
llama_vision
llama4
llava_next
llguidance
lora_zephyr
mcp_client
minicpmo_2_6
mistral3
mixture_of_quant_experts
multi_model_example
multimodal_auto_device_map
online_calibration
paged_attention
phi3v
phi3v_base64
phi3v_local_img
phi4mm
phi4mm_audio
plain
pydantic_schema
qwen2vl
qwen3
qwen3_5
qwen3_embedding
qwen3_next
qwen3_vl
regex
smollm3
smolvlm
streaming
test_multi_model
text_auto_device_map
token_source
tool_call
topology
web_search
xlora_gemma
xlora_zephyr
rust
advanced
agent
agent_streaming
anymoe
anymoe_lora
auto_device_map
batching
batching_embeddings
code_execution
code_execution_approval
code_execution_files
embeddings
error_handling
file_logging
grammar
json_schema
llguidance
logits_processor
lora
mcp_client
multi_model
paged_attn
perplexity
search_callback
tool_callback
tools
web_search
xlora
cookbook
agent
multiturn
rag
structured
getting-started
embedding
gguf
gguf_locally
multimodal
streaming
text_generation
models
asr
audio
diffusion
diffusion_gemma
multimodal
multimodal_models
multimodal_multiturn
speech
text_models
quantization
imatrix
isq
mixture_of_quant_experts
online_calibration
topology
uqff
uqff_multimodal
server
adapter_chat
agentic_tool_rounds
anthropic_agentic
anthropic_chat
anthropic_streaming
anthropic_tool_calling
chat
code_execution_approval
completion
dia
embedding
flux
gemma3
gemma3n
gemma4
gemma4_video
gpt_oss
idefics2
idefics3
json_schema
lark
llama_vision
llama4
llava
llava_next
llguidance
mcp_chat
minicpmo_2_6
mistral3
multi_model_chat
openai_response_format
phi3v
phi3v_base64
phi3v_local_img
phi4mm
phi4mm_audio
qwen2vl
qwen3
qwen3_5
qwen3_next
qwen3_vl
regex
responses
responses_audio
responses_vision
smollm3
stream_completion_bench
streaming
streaming_completion
streaming_tool_calling
tool_calling
tool_dispatch
web_search
Reference
cli
CLI reference
mistralrs serve
mistralrs run
mistralrs completions
mistralrs quantize
mistralrs doctor
mistralrs tune
mistralrs login
mistralrs cache
mistralrs bench
mistralrs from-config
mistralrs update
mistralrs uninstall
python
Runner
Which
Requests
Responses
Python API
Enums
Search
AnyMoE
Code execution
Agent approvals
Files
MCP
Auto-mapping
Reference
Cargo features
TOML configuration
Environment variables
Hardware support
HTTP API semantics
MCP configuration schema
OpenAI compatibility
Quantization types
Rust SDK reference
Sandbox
Supported models
Troubleshooting
UQFF format
HTTP API (generated)
Overview
Mistral.rs
Overview
calibration_apply
calibration_start
calibration_status
health
Axum handler for `GET /metrics`. Renders the Prometheus exposition format.
re_isq
resolve_agent_approval
Speech generation endpoint handler.
OpenAI-compatible chat completions endpoint handler.
OpenAI-compatible completions endpoint handler.
embeddings
list_files
get_file
delete_file
get_file_content
Image generation endpoint handler.
anthropic_messages
anthropic_count_tokens
models
reload_model
get_model_status
tune_model
unload_model
Create response endpoint - OpenResponses API
Get response by ID endpoint
Delete response by ID endpoint
Cancel response endpoint
GET `/v1/sessions/{session_id}`. 404 if the session doesn't exist.
PUT `/v1/sessions/{session_id}`. Replaces any existing session.
DELETE `/v1/sessions/{session_id}`. Idempotent: returns 200 either way.
system_doctor
system_info
Developer Guide
Developer Guide
Architecture
Build from source
MoE expert backends
The multimodal pipeline
Session memory
GitHub
Discord
Select theme
Dark
Light
Auto
Overview
Mistral.rs
0.8.3
Section titled “Mistral.rs 0.8.3”
Fast, flexible LLM inference.
Information
License: MIT
OpenAPI version:
3.1.0
Operations
Section titled “ Operations ”
POST
/calibration/apply
POST
/calibration/start
GET
/calibration/status
GET
/health
GET
/metrics
POST
/re_isq
POST
/v1/agent/approvals/{approval_id}
POST
/v1/audio/speech
POST
/v1/chat/completions
POST
/v1/completions
POST
/v1/embeddings
GET
/v1/files
GET
/v1/files/{id}
DELETE
/v1/files/{id}
GET
/v1/files/{id}/content
POST
/v1/images/generations
POST
/v1/messages
POST
/v1/messages/count_tokens
GET
/v1/models
POST
/v1/models/reload
POST
/v1/models/status
POST
/v1/models/tune
POST
/v1/models/unload
POST
/v1/responses
GET
/v1/responses/{response_id}
DELETE
/v1/responses/{response_id}
POST
/v1/responses/{response_id}/cancel
GET
/v1/sessions/{session_id}
PUT
/v1/sessions/{session_id}
DELETE
/v1/sessions/{session_id}
POST
/v1/system/doctor
GET
/v1/system/info