mistralrs-cli can run entirely from a single TOML configuration file. This config supports multiple models and mirrors the CLI options.
mistralrs from-config --file path/to/config.toml
command = "serve"
[server]
port = 1234
ui = true
[runtime]
max_seqs = 32
[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"
[models.quantization]
in_situ_quant = "q4k"
Option Commands Description
commandall Required. Either "serve" or "run"
enable_thinkingrun Enable thinking mode (default: false)
default_model_idserve Default model ID for API requests (must match a model_id in [[models]])
Global options that apply to the entire run.
Option Default Description
seednone Random seed for reproducibility
lognone Log all requests/responses to this file path
token_source"cache"HuggingFace auth: "cache", "none", "literal:<token>", "env:<var>", "path:<file>"
HTTP server configuration.
Option Default Description
port1234HTTP server port
host"0.0.0.0"Bind address
uifalseServe built-in web UI at /ui
mcp_portnone MCP protocol server port (enables MCP if set)
mcp_confignone MCP client configuration file path
Runtime inference options.
Option Default Description
max_seqs32Maximum concurrent sequences
no_kv_cachefalseDisable KV cache entirely
prefix_cache_n16Number of prefix caches to hold (0 to disable)
chat_templatenone Custom chat template file (.json or .jinja)
jinja_explicitnone Explicit JINJA template override
enable_searchfalseEnable web search
search_embedding_modelnone Embedding model for search (e.g., "embedding-gemma")
PagedAttention configuration.
Option Default Description
mode"auto""auto" (CUDA on, Metal off), "on", or "off"
context_lennone Allocate KV cache for this context length
memory_mbnone GPU memory to allocate in MB (conflicts with context_len)
memory_fractionnone GPU memory utilization 0.0-1.0 (conflicts with above)
block_size32Tokens per block
cache_type"auto"KV cache type
Note: If none of context_len, memory_mb, or memory_fraction are specified, defaults to 90% of available VRAM. Each are mutually exclusive.
Define one or more models. Each [[models]] entry creates a new model.
Option Required Description
kindyes Model type: "auto", "text", "vision", "diffusion", "speech", "embedding"
model_idyes HuggingFace model ID or local path
tokenizerno Path to local tokenizer.json
archno Model architecture (auto-detected if not specified)
dtype"auto"Data type: "auto", "f16", "bf16", "f32"
chat_templateno Per-model chat template override
jinja_explicitno Per-model JINJA template override
Option Default Description
formatauto "plain" (safetensors), "gguf", or "ggml"
quantized_filenone Quantized filename(s) for GGUF/GGML (semicolon-separated)
tok_model_idnone Model ID for tokenizer when using quantized format
gqa1GQA value for GGML models
Option Description
loraLoRA adapter ID(s), semicolon-separated
xloraX-LoRA adapter ID (conflicts with lora)
xlora_orderX-LoRA ordering JSON file (requires xlora)
tgt_non_granular_indexTarget non-granular index for X-LoRA
Option Description
in_situ_quantISQ level: "4", "8", "q4_0", "q4k", "q6k", etc.
from_uqffUQFF file(s) to load (semicolon-separated). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc.
isq_organizationISQ strategy: "default" or "moqe"
imatriximatrix file for enhanced quantization
calibration_fileCalibration file for imatrix generation
Option Default Description
cpufalseForce CPU-only (must be consistent across all models)
device_layersauto Layer mapping, e.g., ["0:10", "1:20"] format: ORD:NUM;...
topologynone Topology YAML file
hf_cachenone Custom HuggingFace cache directory
max_seq_len4096Max sequence length for auto device mapping
max_batch_size1Max batch size for auto device mapping
Option Description
max_edgeMaximum edge length for image resizing
max_num_imagesMaximum images per request
max_image_lengthMaximum image dimension for device mapping
command = "serve"
[global]
seed = 42
[server]
host = "0.0.0.0"
port = 1234
ui = true
[runtime]
max_seqs = 32
enable_search = true
search_embedding_model = "embedding-gemma"
[paged_attn]
mode = "auto"
[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.2-3B-Instruct"
dtype = "auto"
[models.quantization]
in_situ_quant = "q4k"
[[models]]
kind = "vision"
model_id = "Qwen/Qwen2-VL-2B-Instruct"
[models.vision]
max_num_images = 4
[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"
command = "run"
enable_thinking = true
[runtime]
max_seqs = 16
[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"
command = "serve"
[server]
port = 1234
[[models]]
kind = "text"
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
[models.format]
format = "gguf"
quantized_file = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
command = "serve"
[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.1-70B-Instruct"
[models.device]
device_layers = ["0:40", "1:40"]
[models.quantization]
in_situ_quant = "q4k"
cpu must be consistent across all models if specified
default_model_id (serve only) must match a model_id in [[models]]
search_embedding_model requires enable_search = true