Serve multiple models from one process
mistralrs serve -m <model> loads exactly one model. To host multiple models in one server, use a TOML config and mistralrs from-config.
Starting a multi-model server
Section titled “Starting a multi-model server”Create models.toml:
command = "serve"default_model_id = "Qwen/Qwen3-4B"
[server]host = "0.0.0.0"port = 1234
[[models]]model_id = "Qwen/Qwen3-4B"
[models.quantization]quant = "4"
[[models]]model_id = "google/gemma-4-E4B-it"
[models.quantization]quant = "4"quant = "4" is the front door (same as --quant 4): it prefers a prebuilt UQFF and falls back to 4-bit ISQ (in-situ quantization) at load time. Use isq = "4" (same as --isq 4) for explicit ISQ; drop both to load full precision.
Start with from-config:
mistralrs from-config -f models.tomlRouting rules:
- Each
[[models]]entry is one loaded model, running on its own engine. - A request targets a model by its
model_id. default_model_id(if set) must match one of themodel_ids.- It is used when a request omits
modelor sends"default".
kind is optional and defaults to auto; set it only when you need to force a loader.
Routing a request to a specific model
Section titled “Routing a request to a specific model”Target a model by id. Omitting it (or passing "default") selects the default_model_id, or fails if none is configured.
The model field selects the target:
curl http://localhost:1234/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "google/gemma-4-E4B-it", "messages": [{"role": "user", "content": "Hello!"}] }'Against a running multi-model server, an OpenAI client sets model to the target id:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")
response = client.chat.completions.create( model="google/gemma-4-E4B-it", messages=[{"role": "user", "content": "Hello!"}],)print(response.choices[0].message.content)The in-process Python Runner loads a single model; multi-model loading is server-only.
Every request method on Model has a _with_model variant taking an optional id; None selects the default:
let response = model .send_chat_request_with_model(messages, Some("gemma-vision")) .await?;The id is the model's model_id, or its alias when added with add_model_with_alias. See the MultiModelBuilder below for loading the models.
Listing loaded models
Section titled “Listing loaded models”curl http://localhost:1234/v1/modelsEach entry includes id, object, created, owned_by, plus optional status (loaded/unloaded/reloading), tools_available, and mcp_tools_count/mcp_servers_connected for MCP (Model Context Protocol) servers.
Unloading and reloading on demand
Section titled “Unloading and reloading on demand”POST /v1/models/unload frees a model's memory, POST /v1/models/reload brings it back, and POST /v1/models/status queries its state; each takes {"model_id": "..."}. Request and response schemas are in the generated HTTP API reference.
Multi-model from code (Rust)
Section titled “Multi-model from code (Rust)”In-process multi-model loading is Rust-only (MultiModelBuilder); the Python Runner and CLI load a single model. Route per request with the _with_model methods shown above.
use mistralrs::{IsqType, MultiModelBuilder, TextModelBuilder, MultimodalModelBuilder};
let model = MultiModelBuilder::new() .add_model(TextModelBuilder::new("Qwen/Qwen3-4B").with_isq(IsqType::Q4K)) .add_model_with_alias( "gemma-vision", MultimodalModelBuilder::new("google/gemma-4-E4B-it").with_isq(IsqType::Q4K), ) .with_default_model("Qwen/Qwen3-4B") .build() .await?;When setting [models.device] in TOML, cpu must be consistent across every model entry.