Serve multiple models from one process

mistralrs serve -m <model> loads exactly one model. To host multiple models in one server, use a TOML config and mistralrs from-config.

Starting a multi-model server

Create models.toml:

command = "serve"
default_model_id = "Qwen/Qwen3-4B"

[server]
host = "0.0.0.0"
port = 1234

[[models]]
model_id = "Qwen/Qwen3-4B"

[models.quantization]
quant = "4"

[[models]]
model_id = "google/gemma-4-E4B-it"

[models.quantization]
quant = "4"

quant = "4" is the front door (same as --quant 4): it prefers a prebuilt UQFF and falls back to 4-bit ISQ (in-situ quantization) at load time. Use isq = "4" (same as --isq 4) for explicit ISQ; drop both to load full precision.

Start with from-config:

mistralrs from-config -f models.toml

Routing rules:

Each [[models]] entry is one loaded model, running on its own engine.
A request targets a model by its model_id.
default_model_id (if set) must match one of the model_ids.
It is used when a request omits model or sends "default".

kind is optional and defaults to auto; set it only when you need to force a loader.

Routing a request to a specific model

Target a model by id. Omitting it (or passing "default") selects the default_model_id, or fails if none is configured.

The model field selects the target:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E4B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Against a running multi-model server, an OpenAI client sets model to the target id:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")

response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

The in-process Python Runner loads a single model; multi-model loading is server-only.

Every request method on Model has a _with_model variant taking an optional id; None selects the default:

let response = model
    .send_chat_request_with_model(messages, Some("gemma-vision"))
    .await?;

The id is the model's model_id, or its alias when added with add_model_with_alias. See the MultiModelBuilder below for loading the models.

Listing loaded models

curl http://localhost:1234/v1/models

Each entry includes id, object, created, owned_by, plus optional status (loaded/unloaded/reloading), tools_available, and mcp_tools_count/mcp_servers_connected for MCP (Model Context Protocol) servers.

Unloading and reloading on demand

POST /v1/models/unload frees a model's memory, POST /v1/models/reload brings it back, and POST /v1/models/status queries its state; each takes {"model_id": "..."}. Request and response schemas are in the generated HTTP API reference.

Multi-model from code (Rust)

In-process multi-model loading is Rust-only (MultiModelBuilder); the Python Runner and CLI load a single model. Route per request with the _with_model methods shown above.

use mistralrs::{IsqType, MultiModelBuilder, TextModelBuilder, MultimodalModelBuilder};

let model = MultiModelBuilder::new()
    .add_model(TextModelBuilder::new("Qwen/Qwen3-4B").with_isq(IsqType::Q4K))
    .add_model_with_alias(
        "gemma-vision",
        MultimodalModelBuilder::new("google/gemma-4-E4B-it").with_isq(IsqType::Q4K),
    )
    .with_default_model("Qwen/Qwen3-4B")
    .build()
    .await?;

When setting [models.device] in TOML, cpu must be consistent across every model entry.