Production checklist
Work through this list before a mistralrs serve deployment receives traffic from users or another service. Each item links to the page that owns the details.
Network and auth
Section titled “Network and auth”- Bind to loopback unless the host network is private:
mistralrs serve --host 127.0.0.1 --port 8080 -m <model>. - Terminate TLS and validate credentials in a reverse proxy (nginx, Caddy, Traefik). mistral.rs has no built-in authentication -
Authorization: Bearer ...headers from OpenAI clients are accepted but never validated by the server. - Know the defaults you inherit: 50 MB request body limit, CORS allows any origin. Neither is CLI-configurable; embed
mistralrs-server-core’s router builder for custom values (see embed in axum).
Reproducible startup
Section titled “Reproducible startup”- Move the launch command into a TOML config and start with
mistralrs from-config -f config.toml. See the TOML config reference. - Pin versions: a container version tag (not
*-latest) per the Docker guide, and be aware model ids resolve to themainrevision at download time. - Persist the model cache across restarts (volume at
HF_HOME).
Resource sizing
Section titled “Resource sizing”-
Run
mistralrs doctoron the target host to confirm the expected accelerator and compiled features. -
Run
mistralrs tune -m <model>for a starting quantization and memory plan (it estimates from configs; it does not load or benchmark the model). -
Set
--max-seqsdeliberately (default 32). -
Size the paged-attention KV pool explicitly instead of relying on the implicit 90% budget, with one of:
--pa-context-len--pa-memory-mb--pa-memory-fraction
See throughput tuning.
Health, readiness, and metrics
Section titled “Health, readiness, and metrics”- Liveness:
GET /healthreturns 200 when the server is listening. It does not verify model load. - Readiness:
GET /v1/modelsincludes a per-modelstatusfield (loaded,unloaded,reloading). Probe for the specific model id the caller needs, not just process liveness. - Scrape
GET /metrics: Prometheus text format with per-request counters, latency histograms, in-flight request gauges, and request-body histograms. Details live in observability and the HTTP API reference. - Give startup probes a generous window; first-run model loading can take minutes.
Logging
Section titled “Logging”- Default output is curated
INFOstartup logs, dependency warnings, and HTTP access logs for non-housekeeping requests. Use observability for request ids, Prometheus metrics, and access-log controls. - Use
-vfor debug,-vvfor trace, orRUST_LOGfor an explicit filter. - Only use
-l/--log <path>where request and response bodies can be stored safely; it logs payloads, not just metadata.
State across restarts
Section titled “State across restarts”- Sessions are in-memory with a 30-minute idle TTL and 128-entry capacity; they do not survive restarts. Export with
GET /v1/sessions/{id}before shutdown and re-import withPUT /v1/sessions/{id}if persistence is required. See sessions.
Multi-model
Section titled “Multi-model”- For serving several models from one process, use
mistralrs from-configwith[[models]]entries and decide the default model id. See multiple models.