TOML configuration
mistralrs from-config -f <path> reads a TOML file. The top-level command field selects serve or run. Every key maps to a CLI flag of the same subcommand; the mapping is listed per table below. For per-flag semantics, see the generated CLI reference.
Minimal example
Section titled “Minimal example”command = "serve"
[server]host = "0.0.0.0"port = 1234
[[models]]model_id = "Qwen/Qwen3-4B"
[models.quantization]quant = "4"mistralrs from-config -f this.toml runs the server.
Top-level fields
Section titled “Top-level fields”| Field | Type | Applies to | Purpose |
|---|---|---|---|
command | string | both | "serve" or "run". |
default_model_id | string | serve | Model id treated as the default. Must match one of the [[models]] entries. |
thinking | bool | run | Force thinking mode on or off for models that support it (alias: enable_thinking). Omit to defer to the chat template default. Maps to --thinking on mistralrs run. |
[global] section
Section titled “[global] section”| Field | CLI flag | Default | Purpose |
|---|---|---|---|
seed | --seed | not set | Sampling seed. |
log | -l, --log | not set | Log file for requests/responses. |
token_source | --token-source | cache | Token source string (literal:<token>, env:<var>, path:<file>, cache, none). |
-v/--verbose has no TOML equivalent; use RUST_LOG instead.
[runtime] section
Section titled “[runtime] section”| Field | CLI flag | Default | Purpose |
|---|---|---|---|
max_seqs | --max-seqs | 32 | Max concurrent sequences. |
no_kv_cache | --no-kv-cache | false | Disable KV cache entirely. |
prefix_cache_n | --prefix-cache-n | 16 | Prefix caches retained (0 to disable). |
chat_template | -c, --chat-template | not set | Custom chat template file (.json or .jinja), applied to every model. Per-model chat_template in [[models]] overrides it. |
jinja_explicit | -j, --jinja-explicit | not set | Explicit Jinja template override. Per-model jinja_explicit also exists. |
matformer_config_path | --matformer-config-path | not set | MatFormer (nested-submodel) slice config (CSV/JSON). |
matformer_slice_name | --matformer-slice-name | not set | MatFormer slice to load. Requires matformer_config_path. |
mtp_model | --mtp-model | not set | MTP (multi-token prediction) assistant model id or path. |
mtp_n_predict | --mtp-n-predict | not set | MTP draft tokens proposed per target step. |
mcp_config | --mcp-config | not set | MCP (Model Context Protocol) client configuration JSON for outbound servers. Also reads MCP_CONFIG_PATH if unset. |
agent | --agent (alias --agentic) | false | Shortcut for enable_search = true + enable_code_execution = true + enable_shell = true. |
enable_search | --enable-search | false | Enable the built-in web search tool. |
search_embedding_model | --search-embedding-model | not set | Search reranker; embedding-gemma is the only accepted value. Requires enable_search (or agent). |
enable_code_execution | --enable-code-execution | false | Enable Python code execution. |
code_exec_python | --code-exec-python | python on Windows, python3 elsewhere | Python interpreter. Requires enable_code_execution (or agent). |
code_exec_timeout | --code-exec-timeout | 60 | Per-call timeout in seconds. Requires enable_code_execution (or agent). |
code_exec_workdir | --code-exec-workdir | per-session temp dir | Code execution working directory. Requires enable_code_execution (or agent). |
enable_shell | --enable-shell | false | Enable the built-in shell tool for Responses tools[*].type="shell". |
shell_path | --shell-path | /bin/sh on Unix, cmd on Windows | Shell executable. Requires enable_shell (or agent). |
shell_timeout | --shell-timeout | 600 | Per-call shell timeout in seconds. Requires enable_shell (or agent). |
shell_workdir | --shell-workdir | per-session temp dir | Root directory for per-session shell working directories. Requires enable_shell (or agent). |
skills_dir | --skills-dir | system temp dir | Directory for uploaded OpenAI-compatible Skills. Requires enable_shell (or agent). |
agent_permission | --agent-permission | auto | auto, ask, or deny: whether model-requested agent actions run automatically, require approval, or are denied. code_exec_permission / --code-exec-permission are accepted as aliases. |
[server] section (serve only)
Section titled “[server] section (serve only)”| Field | CLI flag | Default | Purpose |
|---|---|---|---|
host | --host | 0.0.0.0 | Bind address. |
port | -p, --port | 1234 | TCP port. |
no_ui | --no-ui | false | Disable the built-in web UI (mounted at /ui by default). |
mcp_port | --mcp-port | not set | Also expose the loaded model as an MCP server on this port (JSON-RPC 2.0 at POST /mcp). See serve over MCP. |
max_tool_rounds | --max-tool-rounds | not set | Default cap on agentic tool loop rounds. Per-request values from the HTTP API override it; the safety cap is 256 when unset. |
tool_dispatch_url | --tool-dispatch-url | not set | URL to POST tool calls to for server-side execution. Only configurable server-side, never per-request. |
disable_access_log | --disable-access-log | false | Disable info-level HTTP access logs. |
access_log_format | --access-log-format | text | Access-log format: text or json. |
access_log_health | --access-log-health | false | Include health, metrics, docs, and UI requests in HTTP access logs. |
disable_request_id_header | --disable-request-id-header | false | Stop echoing x-request-id on responses. |
disable_metrics | --disable-metrics | false | Disable Prometheus HTTP metrics and recorder installation. |
The MCP client configuration (mcp_config) lives under [runtime], not [server]: it applies to run as well as serve.
[paged_attn] section
Section titled “[paged_attn] section”| Field | CLI flag | Default | Purpose |
|---|---|---|---|
mode | --paged-attn | auto | auto (on for CUDA, off for Metal/CPU), on, or off. |
context_len | --pa-context-len | not set | Allocate KV cache for this context length. |
memory_mb | --pa-memory-mb | not set | KV cache budget in MB. Conflicts with context_len. |
memory_fraction | --pa-memory-fraction | not set | KV cache budget as fraction of VRAM (0.0 to 1.0). Conflicts with context_len and memory_mb. |
block_size | --pa-block-size | not set | Tokens per block. |
cache_type | --pa-cache-type | auto | KV cache quantization type. |
[sandbox] section
Section titled “[sandbox] section”OS-level isolation for the code-execution subprocess. Mechanics and threat model: sandbox reference.
| Field | CLI flag | Default | Purpose |
|---|---|---|---|
mode | --sandbox | auto | auto (on for Linux/macOS, no-op elsewhere), on (missing isolation is a hard error), or off. |
profile | --sandbox-profile | profile-dependent | developer for agent/code/shell tools, otherwise restricted. |
max_memory_mb | --sb-max-memory-mb | 2048 | Per-session memory cap in MiB. |
max_cpu_secs | --sb-max-cpu-secs | 600 | Per-session CPU time cap in seconds. When rlimits apply, this is raised before execution to at least the enabled code or shell timeout. |
max_procs | --sb-max-procs | 64 | Per-session process/thread cap. |
network | --sandbox-network | profile-dependent | none, loopback, or full. Defaults to full for developer, loopback for restricted. |
[[models]] array
Section titled “[[models]] array”Each entry defines one loaded model.
| Field | Type | Required | Purpose |
|---|---|---|---|
kind | enum | no | Defaults to auto. Set to text, multimodal, diffusion, speech, or embedding only to force a loader. |
model_id | string | yes | Hugging Face id or local path. |
tokenizer | path | no | Local tokenizer.json. |
arch | enum | no | Architecture override (text models). |
dtype | enum | no | auto, f16, bf16, f32. |
chat_template | path | no | Chat template override for this model. |
jinja_explicit | path | no | Jinja override for this model. |
matformer_config_path | path | no | MatFormer slice config (CSV/JSON). |
matformer_slice_name | string | no | MatFormer slice to load. |
Each [[models]] entry can carry nested sections whose field shapes mirror the corresponding CLI flags:
| Section | Purpose |
|---|---|
[models.format] | Weight format selection (e.g. GGUF file/repo). |
[models.adapter] | LoRA/X-LoRA adapter configuration. |
[models.quantization] | Quantization: quant (front-door, same as --quant), isq (explicit ISQ, same as --isq), from_uqff, isq_organization, imatrix. |
[models.device] | Device placement: cpu, device_layers, topology, hf_cache, max_seq_len, max_batch_size. cpu must be consistent across every entry. |
[models.multimodal] | Multimodal load-time caps (image/video/audio limits). |
Multi-model example
Section titled “Multi-model example”command = "serve"default_model_id = "Qwen/Qwen3-4B"
[server]host = "0.0.0.0"port = 1234
[runtime]enable_search = truesearch_embedding_model = "embedding-gemma"
[[models]]model_id = "Qwen/Qwen3-4B"
[models.quantization]quant = "4"
[[models]]model_id = "google/gemma-4-E4B-it"
[models.quantization]quant = "4"Validation
Section titled “Validation”Invalid configs abort startup with a message identifying the problem:
- At least one entry in
[[models]]. default_model_idmatches amodel_idin[[models]].cpuis consistent across all models when set.search_embedding_modelrequiresenable_search = true(oragent = true).code_exec_python,code_exec_timeout, andcode_exec_workdireach requireenable_code_execution = true(oragent = true).
CLI usage notes
Section titled “CLI usage notes”Flag interactions that hold on the command line and as TOML keys:
quant(CLI--quant, TOML keyquant) is the front door: it tries a prebuilt UQFF (Universal Quantized File Format) first and falls back to ISQ (in-situ quantization). It conflicts withisq(--isq, the explicit ISQ level) andfrom_uqff(--from-uqff).mistralrs tunerejectsquant = "auto"(--quant auto) becausetuneis the recommender.--calibration-fileconflicts with--imatrix.--xloraconflicts with--lora.--xlora-orderand--tgt-non-granular-indexrequire--xlora;--xloraalone is accepted.--matformer-slice-namerequires--matformer-config-path.mistralrs run:--image,--video, and--audiorequire-i/--input.mistralrs bench:--prompt-lenand--depthaccept comma-separated values for sweeps.- Each
--prompt-lenvalue produces a prefill measurement at that prompt length. - Each
--depthvalue produces a decode measurement that prefillsdepthtokens and then generates--gen-lentokens. --depthmust be greater than 0 when--gen-lenis greater than 0.
- Each
Server behavior notes
Section titled “Server behavior notes”- CORS and body limit. Not exposed as CLI flags or TOML keys. Defaults: any origin; methods
GET,POST,PUT,DELETE; allowed headersContent-Type,Authorization,x-api-key,anthropic-version,anthropic-beta,x-request-id; exposed headersx-request-id; 50 MB request body limit. Configure programmatically throughMistralRsServerRouterBuilderinmistralrs-server-core. - Authentication. mistral.rs does not implement authentication. Put a reverse proxy (nginx, Caddy, Traefik) in front for auth and TLS. OpenAI-protocol clients always send
Authorization: Bearer ...because the OpenAI SDK requires an API key; mistral.rs does not validate the header. - Logging and metrics. Access logs are written to normal server stdout/stderr by default, with request ids and route/status/latency metadata.
GET /metricsexposes Prometheus HTTP metrics by default. See observability. - Payload logging.
-venables debug detail and-vvtrace-level file/cache internals;RUST_LOGmodule filters (e.g.RUST_LOG=mistralrs_core=debug,tower_http=info) override both.-l <path>logs all requests and responses to a file.