Skip to content

Serve multiple models from one process

mistralrs serve -m <model> loads exactly one model. To host multiple models in one server, use a TOML config and mistralrs from-config.

Create models.toml:

command = "serve"
default_model_id = "Qwen/Qwen3-4B"
[server]
host = "0.0.0.0"
port = 1234
[[models]]
kind = "text"
model_id = "Qwen/Qwen3-4B"
[models.quantization]
in_situ_quant = "4"
[[models]]
kind = "multimodal"
model_id = "google/gemma-4-E4B-it"
[models.quantization]
in_situ_quant = "4"

Start with from-config:

Terminal window
mistralrs from-config -f models.toml

Each [[models]] entry is one loaded model. The request id is the entry’s model_id. default_model_id (if set) must match a model_id and is used when a request omits model or sends "default".

Mixed kind values coexist. Each model runs on its own engine.

The model field selects the target:

Terminal window
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-E4B-it",
"messages": [{"role": "user", "content": "Hello!"}]
}'

Omitting model or passing "default" selects the default_model_id (or fails if none is configured).

Terminal window
curl http://localhost:1234/v1/models

Each entry includes id, object, created, owned_by, plus optional status (loaded/unloaded/reloading), tools_available, mcp_tools_count, mcp_servers_connected.

Terminal window
curl -X POST http://localhost:1234/v1/models/unload \
-H "Content-Type: application/json" \
-d '{"model_id": "google/gemma-4-E4B-it"}'

Reload:

Terminal window
curl -X POST http://localhost:1234/v1/models/reload \
-H "Content-Type: application/json" \
-d '{"model_id": "google/gemma-4-E4B-it"}'

Rust:

use mistralrs::{IsqType, MultiModelBuilder, TextModelBuilder, MultimodalModelBuilder};
let model = MultiModelBuilder::new()
.add_model(TextModelBuilder::new("Qwen/Qwen3-4B").with_isq(IsqType::Q4K))
.add_model_with_alias(
"gemma-vision",
MultimodalModelBuilder::new("google/gemma-4-E4B-it").with_isq(IsqType::Q4K),
)
.with_default_model("Qwen/Qwen3-4B")
.build()
.await?;

Every request method on Model has a _with_model variant taking an optional id. None selects the default.

cpu in [models.device] must be consistent across every model entry.