mistralrs-cli can run entirely from a single TOML configuration file. This config supports multiple models and mirrors the CLI options.
mistralrs from-config --file path/to/config.toml
command = "serve"
[server]
port = 1234
ui = true
[runtime]
max_seqs = 32
[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"
[models.quantization]
in_situ_quant = "q4k"
| Option | Commands | Description |
command | all | Required. Either "serve" or "run" |
enable_thinking | run | Enable thinking mode (default: false) |
default_model_id | serve | Default model ID for API requests (must match a model_id in [[models]]) |
Global options that apply to the entire run.
| Option | Default | Description |
seed | none | Random seed for reproducibility |
log | none | Log all requests/responses to this file path |
token_source | "cache" | HuggingFace auth: "cache", "none", "literal:<token>", "env:<var>", "path:<file>" |
HTTP server configuration.
| Option | Default | Description |
port | 1234 | HTTP server port |
host | "0.0.0.0" | Bind address |
ui | false | Serve built-in web UI at /ui |
mcp_port | none | MCP protocol server port (enables MCP if set) |
mcp_config | none | MCP client configuration file path |
Runtime inference options.
| Option | Default | Description |
max_seqs | 32 | Maximum concurrent sequences |
no_kv_cache | false | Disable KV cache entirely |
prefix_cache_n | 16 | Number of prefix caches to hold (0 to disable) |
chat_template | none | Custom chat template file (.json or .jinja) |
jinja_explicit | none | Explicit JINJA template override |
enable_search | false | Enable web search |
search_embedding_model | none | Embedding model for search (e.g., "embedding-gemma") |
PagedAttention configuration.
| Option | Default | Description |
mode | "auto" | "auto" (CUDA on, Metal off), "on", or "off" |
context_len | none | Allocate KV cache for this context length |
memory_mb | none | GPU memory to allocate in MB (conflicts with context_len) |
memory_fraction | none | GPU memory utilization 0.0-1.0 (conflicts with above) |
block_size | 32 | Tokens per block |
cache_type | "auto" | KV cache type |
Note: If none of context_len, memory_mb, or memory_fraction are specified, defaults to 90% of available VRAM. Each are mutually exclusive.
Define one or more models. Each [[models]] entry creates a new model.
| Option | Required | Description |
kind | yes | Model type: "auto", "text", "vision", "diffusion", "speech", "embedding" |
model_id | yes | HuggingFace model ID or local path |
tokenizer | no | Path to local tokenizer.json |
arch | no | Model architecture (auto-detected if not specified) |
dtype | "auto" | Data type: "auto", "f16", "bf16", "f32" |
chat_template | no | Per-model chat template override |
jinja_explicit | no | Per-model JINJA template override |
| Option | Default | Description |
format | auto | "plain" (safetensors), "gguf", or "ggml" |
quantized_file | none | Quantized filename(s) for GGUF/GGML (semicolon-separated) |
tok_model_id | none | Model ID for tokenizer when using quantized format |
gqa | 1 | GQA value for GGML models |
| Option | Description |
lora | LoRA adapter ID(s), semicolon-separated |
xlora | X-LoRA adapter ID (conflicts with lora) |
xlora_order | X-LoRA ordering JSON file (requires xlora) |
tgt_non_granular_index | Target non-granular index for X-LoRA |
| Option | Description |
in_situ_quant | ISQ level: "4", "8", "q4_0", "q4k", "q6k", etc. |
from_uqff | UQFF file(s) to load (semicolon-separated) |
isq_organization | ISQ strategy: "default" or "moqe" |
imatrix | imatrix file for enhanced quantization |
calibration_file | Calibration file for imatrix generation |
| Option | Default | Description |
cpu | false | Force CPU-only (must be consistent across all models) |
device_layers | auto | Layer mapping, e.g., ["0:10", "1:20"] format: ORD:NUM;... |
topology | none | Topology YAML file |
hf_cache | none | Custom HuggingFace cache directory |
max_seq_len | 4096 | Max sequence length for auto device mapping |
max_batch_size | 1 | Max batch size for auto device mapping |
| Option | Description |
max_edge | Maximum edge length for image resizing |
max_num_images | Maximum images per request |
max_image_length | Maximum image dimension for device mapping |
command = "serve"
[global]
seed = 42
[server]
host = "0.0.0.0"
port = 1234
ui = true
[runtime]
max_seqs = 32
enable_search = true
search_embedding_model = "embedding-gemma"
[paged_attn]
mode = "auto"
[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.2-3B-Instruct"
dtype = "auto"
[models.quantization]
in_situ_quant = "q4k"
[[models]]
kind = "vision"
model_id = "Qwen/Qwen2-VL-2B-Instruct"
[models.vision]
max_num_images = 4
[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"
command = "run"
enable_thinking = true
[runtime]
max_seqs = 16
[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"
command = "serve"
[server]
port = 1234
[[models]]
kind = "text"
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
[models.format]
format = "gguf"
quantized_file = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
command = "serve"
[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.1-70B-Instruct"
[models.device]
device_layers = ["0:40", "1:40"]
[models.quantization]
in_situ_quant = "q4k"
cpu must be consistent across all models if specified
default_model_id (serve only) must match a model_id in [[models]]
search_embedding_model requires enable_search = true