Skip to content

Troubleshooting

Before debugging setup issues, run mistralrs doctor. It reports detected hardware, compiled accelerator features, and Hugging Face connectivity.

For unlisted issues, file an issue on GitHub with a reproducer.

mistralrs: command not found after install

Section titled “mistralrs: command not found after install”

Installer-managed binaries are exposed at ~/.local/bin/mistralrs, backed by ~/.mistralrs/mistralrs. Open a new shell or run source "$HOME/.mistralrs/env". Manual Cargo installs use ~/.cargo/bin/mistralrs and may require source "$HOME/.cargo/env".

Build fails with flash-attn feature enabled

Section titled “Build fails with flash-attn feature enabled”

Flash attention requires compute capability 8.0+ (see hardware support). On older GPUs, drop flash-attn and rebuild:

  • cuda nccl cudnn on Linux with NCCL installed.
  • cuda cudnn otherwise.

The token must start with hf_. The validation happens in mistralrs login before saving.

Gated repository (Gemma, LLaMA, FLUX.1-dev, etc.)

Section titled “Gated repository (Gemma, LLaMA, FLUX.1-dev, etc.)”

Accept the license on the model’s Hugging Face page, then save a token with mistralrs login. The token is stored at ~/.cache/huggingface/token (or $HF_HOME/token).

Add --quant 4. If still too large, try --quant 2 or split across GPUs with -n "0:N1;1:N2;...". See quantize a model.

Verify accelerator features are compiled in with mistralrs doctor. If cuda is missing, the binary was built without GPU support.

For CUDA decode throughput, also check whether paged attention is active. FlashInfer (a CUDA attention backend) paged decode and CUDA graphs are enabled by default for compatible CUDA paged decode paths.

CUDA graphs apply to supported single-token decode steps only. They do not speed up prompt prefill. The first time a graph shape is seen, mistral.rs pays warmup and capture overhead; steady-state decode is the part that can improve.

If graph capture or replay fails, mistral.rs logs a warning and disables CUDA graphs for that loaded pipeline. Set MISTRALRS_CUDA_GRAPHS=0 to compare with the normal CUDA path. See CUDA graphs.

max_tokens is most likely too low. Check finish_reason:

  • length - token limit reached.
  • stop - generation ended naturally (EOS token) or a configured stop token/string matched.

Check the Server listening on http://... line in the server output to confirm host and port.

The default allows any origin. Custom CORS configuration is only available programmatically through MistralRsServerRouterBuilder.

The default body limit is 50 MB and is not configurable via the CLI. Configure programmatically through MistralRsServerRouterBuilder.

The UI is on by default. Check that --no-ui was not passed at startup, and that no reverse proxy is rewriting /ui.

The session expired (30-minute idle TTL) or was evicted (128-session cap, LRU). Long-lived sessions need explicit export/import via /v1/sessions/{id}. See persist sessions.

from mistralrs import Runner fails with ImportError

Section titled “from mistralrs import Runner fails with ImportError”

The wrong wheel was installed. pip install mistralrs gives the CPU (Linux/Windows) or Metal (macOS) wheel; for NVIDIA, install a CUDA wheel from the release with --find-links and the +cudaNNN.smNN matching your driver and GPU. See Python SDK getting started.

ModelBuilder::build() requires a tokio runtime

Section titled “ModelBuilder::build() requires a tokio runtime”

The SDK requires a running tokio runtime. Use #[tokio::main] or create a runtime with tokio::runtime::Runtime::new().