Skip to content

CLI reference

This page documents what the binary actually exposes. For complete and current help, run mistralrs --help and mistralrs <subcommand> --help.

SubcommandPurpose
mistralrs runLoad a model and open an interactive chat (or one-shot with -i).
mistralrs serveLoad a model and expose an OpenAI-compatible HTTP server.
mistralrs benchBenchmark a model.
mistralrs tuneRecommend a quantization and device-mapping configuration.
mistralrs quantizeGenerate UQFF files from a model.
mistralrs from-configLoad and run from a TOML configuration file.
mistralrs loginSave a Hugging Face token to ~/.cache/huggingface/token.
mistralrs doctorReport system, hardware, and build information.
mistralrs cacheList or delete Hugging Face cache entries.
mistralrs completionsGenerate shell completions (bash, zsh, fish, elvish, powershell).
FlagDefaultPurpose
--seed <int>not setSampling seed.
-l, --log <path>not setLog all requests/responses to a file.
--token-source <source>cacheToken source: literal:<token>, env:<var>, path:<file>, cache, or none.
-v, --verbose0Increase startup logging. Use -v for debug details and -vv for trace-level file/cache internals. RUST_LOG overrides this.

Apply to subcommands that load or inspect a model (serve, run, bench, tune).

FlagDefaultPurpose
-m, --model-id <id>requiredHugging Face repo id or local path.
-t, --tokenizer <path>not setLocal tokenizer.json.
-a, --arch <arch>auto-detectModel architecture.
--dtype <dtype>autoauto, f16, bf16, f32.
--cpuoffForce CPU-only inference.
-n, --device-layers <list>autoPer-device layer counts. Format: ORD:NUM;... (e.g. 0:32;1:32).
--topology <path>not setTopology YAML for per-layer placement and quantization.
--hf-cache <path>not setCustom Hugging Face cache directory.
--max-seq-len <n>4096Max sequence length used for automatic device mapping.
--max-batch-size <n>1Max batch size used for automatic device mapping.

Accepted by serve, run, and bench; tune rejects them.

FlagDefaultPurpose
--no-kv-cacheoffDisable KV cache.
--matformer-config-path <path>not setPath to a MatFormer slice config (CSV/JSON).
--matformer-slice-name <name>not setMatFormer slice to load. Requires --matformer-config-path.

Accepted by serve and run; bench rejects them at startup because it measures plain model generation.

FlagDefaultPurpose
--max-seqs <n>32Max concurrent sequences.
--prefix-cache-n <n>16Number of prefix caches to hold (0 to disable).
-c, --chat-template <path>not setCustom chat template (.json or .jinja).
-j, --jinja-explicit <path>not setExplicit Jinja template override.
--mcp-config <path>not setMCP client configuration for outbound servers. Also reads MCP_CONFIG_PATH if unset.
FlagDefaultPurpose
--format <fmt>auto-detectplain, gguf, or ggml.
-f, --quantized-file <path>not setQuantized filename(s) for GGUF/GGML. Semicolon-separated for multiple.
--tok-model-id <id>not setTokenizer source for quantized formats.
--gqa <n>1GQA value for GGML.
FlagPurpose
--quant <value>Quantization front-door. Numeric (2, 3, 4, 5, 6, 8) and ISQ names (e.g. q4k, afq8, fp8, mxfp4) prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is accepted by serve, run, and bench; tune rejects auto because it is the recommender. Conflicts with --isq and --from-uqff.
--isq <type>Lower-level in-situ quantization knob (no UQFF lookup). Numeric (2, 3, 4, 5, 6, 8) or format name (q4k, afq4, q8_0, etc.).
--from-uqff <path>Load a pre-quantized UQFF file.
--isq-organization <org>default or moqe.
--imatrix <path>imatrix file.
--calibration-file <path>Calibration data for imatrix generation. Conflicts with --imatrix.
FlagPurpose
--lora <ids>LoRA adapter id(s), semicolon-separated.
--xlora <id>X-LoRA adapter id. Requires --xlora-order. Conflicts with --lora.
--xlora-order <path>X-LoRA ordering JSON.
--tgt-non-granular-index <n>X-LoRA target non-granular index.

Accepted by serve and run. bench rejects these flags at startup.

FlagDefaultPurpose
--agent (alias --agentic)offOne-flag agent: equivalent to --enable-search --enable-code-execution with a per-session temp workdir. The agentic loop runs up to 256 tool rounds by default.
--enable-searchoffEnable the built-in web search tool.
--search-embedding-model <name>not setReranker model. Only embedding-gemma is accepted.
--enable-code-executionoffEnable Python code execution (compiled in by default).
--code-exec-python <path>python on Windows, python3 elsewherePython interpreter for code execution.
--code-exec-timeout <secs>30Code execution timeout in seconds.
--code-exec-workdir <path>per-session temp dirCode execution working directory.
--agent-permission <mode>autoauto, ask, or deny. Controls whether agent actions run automatically, require approval, or are denied. See agent permissions. --code-exec-permission is accepted as an alias.

OS-level isolation applied to the code-execution subprocess. See sandbox reference for the layering and threat model.

FlagDefaultPurpose
--sandbox <mode>autoauto, on, or off. auto enables on Linux/macOS, no-op on Windows.
--sb-max-memory-mb <mb>2048Per-session memory cap (RLIMIT_AS, plus cgroup memory.max when available).
--sb-max-cpu-secs <secs>300Per-session CPU time cap (RLIMIT_CPU).
--sb-max-procs <n>64Per-session process/thread cap (RLIMIT_NPROC, plus cgroup pids.max).
--sandbox-network <mode>loopbacknone, loopback, or full. none denies socket(2) outright.
FlagDefaultPurpose
--paged-attn <mode>autoauto, on, or off.
--pa-context-len <n>not setAllocate KV cache for this context length.
--pa-memory-mb <mb>not setGPU memory in MB for KV cache. Conflicts with --pa-context-len.
--pa-memory-fraction <f>not setGPU memory utilization fraction (0.0 to 1.0).
--pa-block-size <n>not setTokens per block.
--pa-cache-type <type>autoKV cache quantization type.
FlagPurpose
--max-edge <px>Max edge length for image resizing (aspect ratio preserved).
--max-num-images <n>Max images per request.
--max-image-length <px>Max image dimension for device mapping.
FlagDefaultPurpose
--host <ip>0.0.0.0Bind address.
-p, --port <port>1234TCP port.
--no-uioffDisable the built-in web UI (mounted at /ui by default).
--mcp-port <port>not setEnable MCP server on a separate port.
--max-tool-rounds <n>not setCap on agentic tool loop rounds.
--tool-dispatch-url <url>not setExternal URL for tool execution.

CORS allowed origins and the request body limit (default 50 MB) are not exposed as CLI flags. They can be configured programmatically through MistralRsServerRouterBuilder in mistralrs-server-core.

FlagPurpose
-i, --input <text>Send a single prompt non-interactively and exit.
--image <path>Attach an image (repeatable, requires -i).
--audio <path>Attach audio (repeatable, requires -i).
--video <path>Attach video (repeatable, requires -i).
--thinking [bool]Control thinking mode for models that support it.
FlagDefaultPurpose
--prompt-len <n>512Prompt length per iteration.
--gen-len <n>128Generation length per iteration.
--iterations <n>3Number of measured runs to average.
--warmup <n>1Number of warmup runs (discarded).
FlagDefaultPurpose
--profile <p>balancedquality, balanced, or fast.
--jsonoffEmit JSON instead of a human-readable table.
--emit-config <path>not setWrite the recommended settings as TOML.
FlagPurpose
--isq <types>ISQ levels to produce. Repeatable or comma-separated.
-o, --output <path>Output file (single ISQ) or directory (multiple).
--no-readmeSkip README generation.
--uqff-base-model <id>Base model id for the README.
--uqff-repo-id <id>Hugging Face repo id for the README.
--isq-organization <org>default or moqe.
--imatrix <path> / --calibration-file <path>Quantization enhancement options.
FlagPurpose
-f, --file <path>TOML configuration file (required).
FlagPurpose
--token <token>Provide the token non-interactively. Must start with hf_.

Without --token, the command prompts interactively. The token is saved to ~/.cache/huggingface/token (or $HF_HOME/token if set).

SubcommandPurpose
mistralrs cache listList cached model entries.
mistralrs cache delete -m <id>Remove a cache entry.

Run mistralrs doctor after installation or when GPU acceleration, build features, or Hugging Face connectivity look wrong.

FlagPurpose
--jsonEmit JSON instead of human-readable output.

Common ones:

VariablePurpose
RUST_LOGOverride the tracing log filter (e.g. mistralrs_core=debug,tower_http=info).
HF_HOMEHugging Face cache root.
HF_TOKENOverride the cached token at runtime.
HF_HUB_OFFLINEHF_HUB_OFFLINE=1 disables all network calls to the Hugging Face Hub; files are loaded from the local cache only.
MCP_CONFIG_PATHMCP config path (alternative to --mcp-config).
MISTRALRS_SANDBOXauto/on/off. Overrides the sandbox only when the resolved mode is auto; on and off win.

See environment variables for the full list.