Skip to content

Production checklist

Work through this list before a mistralrs serve deployment receives traffic from users or another service. Each item links to the page that owns the details.

  • Bind to loopback unless the host network is private: mistralrs serve --host 127.0.0.1 --port 8080 -m <model>.
  • Terminate TLS and validate credentials in a reverse proxy (nginx, Caddy, Traefik). mistral.rs has no built-in authentication - Authorization: Bearer ... headers from OpenAI clients are accepted but never validated by the server.
  • Know the defaults you inherit: 50 MB request body limit, CORS allows any origin. Neither is CLI-configurable; embed mistralrs-server-core’s router builder for custom values (see embed in axum).
  • Move the launch command into a TOML config and start with mistralrs from-config -f config.toml. See the TOML config reference.
  • Pin versions: a container version tag (not *-latest) per the Docker guide, and be aware model ids resolve to the main revision at download time.
  • Persist the model cache across restarts (volume at HF_HOME).
  • Run mistralrs doctor on the target host to confirm the expected accelerator and compiled features.

  • Run mistralrs tune -m <model> for a starting quantization and memory plan (it estimates from configs; it does not load or benchmark the model).

  • Set --max-seqs deliberately (default 32).

  • Size the paged-attention KV pool explicitly instead of relying on the implicit 90% budget, with one of:

    • --pa-context-len
    • --pa-memory-mb
    • --pa-memory-fraction

    See throughput tuning.

  • Liveness: GET /health returns 200 when the server is listening. It does not verify model load.
  • Readiness: GET /v1/models includes a per-model status field (loaded, unloaded, reloading). Probe for the specific model id the caller needs, not just process liveness.
  • Scrape GET /metrics: Prometheus text format with per-request counters, latency histograms, in-flight request gauges, and request-body histograms. Details live in observability and the HTTP API reference.
  • Give startup probes a generous window; first-run model loading can take minutes.
  • Default output is curated INFO startup logs, dependency warnings, and HTTP access logs for non-housekeeping requests. Use observability for request ids, Prometheus metrics, and access-log controls.
  • Use -v for debug, -vv for trace, or RUST_LOG for an explicit filter.
  • Only use -l/--log <path> where request and response bodies can be stored safely; it logs payloads, not just metadata.
  • Sessions are in-memory with a 30-minute idle TTL and 128-entry capacity; they do not survive restarts. Export with GET /v1/sessions/{id} before shutdown and re-import with PUT /v1/sessions/{id} if persistence is required. See sessions.
  • For serving several models from one process, use mistralrs from-config with [[models]] entries and decide the default model id. See multiple models.