Run mistralrs in Docker

The published images ship the unified mistralrs binary as their entrypoint, so any CLI subcommand works directly: serve, run, bench, quantize. Running a container with no arguments prints the CLI help.

docker run --rm -p 1234:1234 -v hf-cache:/data -e HF_TOKEN=<token> \
  ghcr.io/ericlbuehler/mistral.rs:latest \
  serve -m Qwen/Qwen3-4B

:latest is the CPU image. For NVIDIA GPUs, choose a CUDA tag and add --gpus all:

docker run --rm --gpus all -p 1234:1234 -v hf-cache:/data \
  ghcr.io/ericlbuehler/mistral.rs:cuda128-sm89-latest \
  serve -m Qwen/Qwen3-4B

The host needs the NVIDIA Container Toolkit; see NVIDIA’s install guide. To pin a specific GPU: --gpus '"device=0"'.

Published tags

All images live at ghcr.io/ericlbuehler/mistral.rs (package page).

CPU: latest (alias of cpu-latest), cpu-latest, cpu-X.Y.Z.
CUDA: cuda128-sm{cc}-latest, cuda129-sm121-latest, cuda130-sm{cc}-latest, cuda131-sm{cc}-latest, cuda132-sm{cc}-latest, cuda133-sm90-latest and matching X.Y.Z version tags.
CUDA legacy aliases: cuda-sm{cc}-latest, cuda-sm{cc}-X.Y.Z point at the cuda131 image.

Choose the CUDA lane from the CUDA version shown by nvidia-smi:

Driver reports	Use
CUDA 13.3+ on Hopper / `sm90`	`cuda133-sm90`
CUDA 13.2+ on Ampere/Ada / `sm80`, `sm86`, `sm89`	`cuda132-sm{cc}`
CUDA 13.1+ on Blackwell / `sm100`, `sm120`, `sm121`	`cuda131-sm{cc}`
CUDA 13.0+	`cuda130-sm{cc}`
CUDA 12.9+ on GB10 / `sm121`	`cuda129-sm121`
CUDA 12.8+	`cuda128-sm{cc}`

cuTile is included only on lanes whose CUDA toolkit supports that SM.

CUDA compute capability variants (SM80+):

80 (A100)
86 (A-series workstation/RTX 30)
89 (RTX 40/L4)
90 (H100)
100 (B200)
120 (RTX 50)
121 (DGX Spark)

See hardware support for the full GPU mapping.

The CPU image and Grace CUDA images (90, 100, 121) are multi-arch (amd64 + arm64). Docker picks the right architecture automatically. The other CUDA tags are x86_64 only.

The *-latest tags publish on releases and on manual CI dispatch from master; version tags pin a release.

For production, pin a version or sha tag rather than *-latest. Model ids also float: -m Qwen/Qwen3-4B resolves to whatever revision is tagged main at download time. The CLI has no revision flag; to pin a revision, use the Rust SDK’s with_hf_revision.

Image contract

Entrypoint is the mistralrs binary; pass a subcommand and its flags as the container command.
mistralrs serve listens on port 1234 by default (the image’s EXPOSEd port). To change it, change the flag and the mapping together: serve -p 8080 with -p 8080:8080. There is no PORT environment variable.
HF_HOME=/data is set in the image: mount a volume at /data to persist downloaded weights (they land in /data/hub). HF authentication for gated models: -e HF_TOKEN=<token>.
Chat templates ship at /chat_templates for models that need one: --chat-template /chat_templates/<file>.json.

Building an image

From a repository checkout:

# CPU
docker build -t mistralrs:latest -f Dockerfile .

# CUDA (set the compute capability for your GPU)
docker build -t mistralrs:cuda -f Dockerfile.cuda-all \
  --build-arg CUDA_COMPUTE_CAP=89 .

Dockerfile.cuda-all accepts CUDA_COMPUTE_CAP, BASE_TAG, and WITH_FEATURES build args. The default base is CUDA 12.8.1 and default features are cuda,cudnn; CI builds add flash-attn, and release images add cutile on supported CUDA/SM pairs.
Dockerfile.cuda-13.0-ubi9 is a Red Hat UBI 9 variant for air-gapped and enterprise deployments.
The first CUDA build is slow because flash-attention compilation takes a while; later builds use the layer cache.

Production deployment notes

Persist the cache. Weights are large enough that re-downloading on every restart is wasteful. Mount a named volume or host path at /data.

Health check. /health returns 200 when the server is up. Add a Docker healthcheck:

HEALTHCHECK --interval=30s --timeout=5s --start-period=180s \
  CMD curl -fsS http://localhost:1234/health || exit 1

The generous --start-period matters: first-run model loading can take minutes.

Resource limits. Set --memory and --gpus on docker run to bound the container’s resources.

Video input. Install FFmpeg inside the image when serving video-capable models. See set up video input for the Docker snippet and runtime check.

Kubernetes

The pieces above translate directly:

Use a Deployment with a readiness probe hitting /health (or a model-aware check; see the production checklist).
Mount a PersistentVolumeClaim at /data for the Hugging Face cache.
Use the NVIDIA device plugin and a nvidia.com/gpu resource request for CUDA.
Use an initContainer to pre-download weights for fast pod startup.

There is no official Helm chart. Contributions welcome.