Skip to content

Serve multiple models from one process

mistralrs serve -m <model> loads exactly one model. To host multiple models in one server, use a TOML config and mistralrs from-config.

Create models.toml:

command = "serve"
default_model_id = "Qwen/Qwen3-4B"
[server]
host = "0.0.0.0"
port = 1234
[[models]]
model_id = "Qwen/Qwen3-4B"
[models.quantization]
quant = "4"
[[models]]
model_id = "google/gemma-4-E4B-it"
[models.quantization]
quant = "4"

quant = "4" is the front door (same as --quant 4): it prefers a prebuilt UQFF and falls back to 4-bit ISQ (in-situ quantization) at load time. Use isq = "4" (same as --isq 4) for explicit ISQ; drop both to load full precision.

Start with from-config:

Terminal window
mistralrs from-config -f models.toml

Routing rules:

  • Each [[models]] entry is one loaded model, running on its own engine.
  • A request targets a model by its model_id.
  • default_model_id (if set) must match one of the model_ids.
  • It is used when a request omits model or sends "default".

kind is optional and defaults to auto; set it only when you need to force a loader.

Target a model by id. Omitting it (or passing "default") selects the default_model_id, or fails if none is configured.

The model field selects the target:

Terminal window
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-E4B-it",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Terminal window
curl http://localhost:1234/v1/models

Each entry includes id, object, created, owned_by, plus optional status (loaded/unloaded/reloading), tools_available, and mcp_tools_count/mcp_servers_connected for MCP (Model Context Protocol) servers.

POST /v1/models/unload frees a model's memory, POST /v1/models/reload brings it back, and POST /v1/models/status queries its state; each takes {"model_id": "..."}. Request and response schemas are in the generated HTTP API reference.

In-process multi-model loading is Rust-only (MultiModelBuilder); the Python Runner and CLI load a single model. Route per request with the _with_model methods shown above.

use mistralrs::{IsqType, MultiModelBuilder, TextModelBuilder, MultimodalModelBuilder};
let model = MultiModelBuilder::new()
.add_model(TextModelBuilder::new("Qwen/Qwen3-4B").with_isq(IsqType::Q4K))
.add_model_with_alias(
"gemma-vision",
MultimodalModelBuilder::new("google/gemma-4-E4B-it").with_isq(IsqType::Q4K),
)
.with_default_model("Qwen/Qwen3-4B")
.build()
.await?;

When setting [models.device] in TOML, cpu must be consistent across every model entry.