Run mistralrs in Docker
The published images ship the unified mistralrs binary as their entrypoint, so any CLI subcommand works directly: serve, run, bench, quantize. Running a container with no arguments prints the CLI help.
docker run --rm -p 1234:1234 -v hf-cache:/data -e HF_TOKEN=<token> \ ghcr.io/ericlbuehler/mistral.rs:latest \ serve -m Qwen/Qwen3-4B:latest is the CPU image. For NVIDIA GPUs, choose a CUDA tag and add --gpus all:
docker run --rm --gpus all -p 1234:1234 -v hf-cache:/data \ ghcr.io/ericlbuehler/mistral.rs:cuda128-sm89-latest \ serve -m Qwen/Qwen3-4BThe host needs the NVIDIA Container Toolkit; see NVIDIA’s install guide. To pin a specific GPU: --gpus '"device=0"'.
Published tags
Section titled “Published tags”All images live at ghcr.io/ericlbuehler/mistral.rs (package page).
- CPU:
latest(alias ofcpu-latest),cpu-latest,cpu-X.Y.Z. - CUDA:
cuda128-sm{cc}-latest,cuda129-sm121-latest,cuda130-sm{cc}-latest,cuda131-sm{cc}-latest,cuda132-sm{cc}-latest,cuda133-sm90-latestand matchingX.Y.Zversion tags. - CUDA legacy aliases:
cuda-sm{cc}-latest,cuda-sm{cc}-X.Y.Zpoint at thecuda131image.
Choose the CUDA lane from the CUDA version shown by nvidia-smi:
| Driver reports | Use |
|---|---|
CUDA 13.3+ on Hopper / sm90 | cuda133-sm90 |
CUDA 13.2+ on Ampere/Ada / sm80, sm86, sm89 | cuda132-sm{cc} |
CUDA 13.1+ on Blackwell / sm100, sm120, sm121 | cuda131-sm{cc} |
| CUDA 13.0+ | cuda130-sm{cc} |
CUDA 12.9+ on GB10 / sm121 | cuda129-sm121 |
| CUDA 12.8+ | cuda128-sm{cc} |
cuTile is included only on lanes whose CUDA toolkit supports that SM.
CUDA compute capability variants (SM80+):
80(A100)86(A-series workstation/RTX 30)89(RTX 40/L4)90(H100)100(B200)120(RTX 50)121(DGX Spark)
See hardware support for the full GPU mapping.
The CPU image and Grace CUDA images (90, 100, 121) are multi-arch (amd64 + arm64). Docker picks the right architecture automatically. The other CUDA tags are x86_64 only.
The *-latest tags publish on releases and on manual CI dispatch from master; version tags pin a release.
For production, pin a version or sha tag rather than *-latest. Model ids also float: -m Qwen/Qwen3-4B resolves to whatever revision is tagged main at download time. The CLI has no revision flag; to pin a revision, use the Rust SDK’s with_hf_revision.
Image contract
Section titled “Image contract”- Entrypoint is the
mistralrsbinary; pass a subcommand and its flags as the container command. mistralrs servelistens on port 1234 by default (the image’sEXPOSEd port). To change it, change the flag and the mapping together:serve -p 8080with-p 8080:8080. There is noPORTenvironment variable.HF_HOME=/datais set in the image: mount a volume at/datato persist downloaded weights (they land in/data/hub). HF authentication for gated models:-e HF_TOKEN=<token>.- Chat templates ship at
/chat_templatesfor models that need one:--chat-template /chat_templates/<file>.json.
Building an image
Section titled “Building an image”From a repository checkout:
# CPUdocker build -t mistralrs:latest -f Dockerfile .
# CUDA (set the compute capability for your GPU)docker build -t mistralrs:cuda -f Dockerfile.cuda-all \ --build-arg CUDA_COMPUTE_CAP=89 .Dockerfile.cuda-allacceptsCUDA_COMPUTE_CAP,BASE_TAG, andWITH_FEATURESbuild args. The default base is CUDA 12.8.1 and default features arecuda,cudnn; CI builds addflash-attn, and release images addcutileon supported CUDA/SM pairs.Dockerfile.cuda-13.0-ubi9is a Red Hat UBI 9 variant for air-gapped and enterprise deployments.- The first CUDA build is slow because flash-attention compilation takes a while; later builds use the layer cache.
Production deployment notes
Section titled “Production deployment notes”Persist the cache. Weights are large enough that re-downloading on every restart is wasteful. Mount a named volume or host path at /data.
Health check. /health returns 200 when the server is up. Add a Docker healthcheck:
HEALTHCHECK --interval=30s --timeout=5s --start-period=180s \ CMD curl -fsS http://localhost:1234/health || exit 1The generous --start-period matters: first-run model loading can take minutes.
Resource limits. Set --memory and --gpus on docker run to bound the container’s resources.
Video input. Install FFmpeg inside the image when serving video-capable models. See set up video input for the Docker snippet and runtime check.
Kubernetes
Section titled “Kubernetes”The pieces above translate directly:
- Use a Deployment with a readiness probe hitting
/health(or a model-aware check; see the production checklist). - Mount a PersistentVolumeClaim at
/datafor the Hugging Face cache. - Use the NVIDIA device plugin and a
nvidia.com/gpuresource request for CUDA. - Use an initContainer to pre-download weights for fast pod startup.
There is no official Helm chart. Contributions welcome.
See also
Section titled “See also”- Production checklist: operational concerns regardless of container layer.
- Serve flag reference: all
mistralrs serveoptions.