Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

mistralrs-cli TOML Config

mistralrs-cli can run entirely from a single TOML configuration file. This config supports multiple models and mirrors the CLI options.

Usage

mistralrs from-config --file path/to/config.toml

Quick Example

command = "serve"

[server]
port = 1234
ui = true

[runtime]
max_seqs = 32

[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"

[models.quantization]
in_situ_quant = "q4k"

Complete Reference

Top-Level Options

OptionCommandsDescription
commandallRequired. Either "serve" or "run"
enable_thinkingrunEnable thinking mode (default: false)
default_model_idserveDefault model ID for API requests (must match a model_id in [[models]])

[global] Section

Global options that apply to the entire run.

OptionDefaultDescription
seednoneRandom seed for reproducibility
lognoneLog all requests/responses to this file path
token_source"cache"HuggingFace auth: "cache", "none", "literal:<token>", "env:<var>", "path:<file>"

[server] Section (serve only)

HTTP server configuration.

OptionDefaultDescription
port1234HTTP server port
host"0.0.0.0"Bind address
uifalseServe built-in web UI at /ui
mcp_portnoneMCP protocol server port (enables MCP if set)
mcp_confignoneMCP client configuration file path

[runtime] Section

Runtime inference options.

OptionDefaultDescription
max_seqs32Maximum concurrent sequences
no_kv_cachefalseDisable KV cache entirely
prefix_cache_n16Number of prefix caches to hold (0 to disable)
chat_templatenoneCustom chat template file (.json or .jinja)
jinja_explicitnoneExplicit JINJA template override
enable_searchfalseEnable web search
search_embedding_modelnoneEmbedding model for search (e.g., "embedding-gemma")

[paged_attn] Section

PagedAttention configuration.

OptionDefaultDescription
mode"auto""auto" (CUDA on, Metal off), "on", or "off"
context_lennoneAllocate KV cache for this context length
memory_mbnoneGPU memory to allocate in MB (conflicts with context_len)
memory_fractionnoneGPU memory utilization 0.0-1.0 (conflicts with above)
block_size32Tokens per block
cache_type"auto"KV cache type

Note: If none of context_len, memory_mb, or memory_fraction are specified, defaults to 90% of available VRAM. Each are mutually exclusive.

[[models]] Section

Define one or more models. Each [[models]] entry creates a new model.

Top-Level Model Options

OptionRequiredDescription
kindyesModel type: "auto", "text", "vision", "diffusion", "speech", "embedding"
model_idyesHuggingFace model ID or local path
tokenizernoPath to local tokenizer.json
archnoModel architecture (auto-detected if not specified)
dtype"auto"Data type: "auto", "f16", "bf16", "f32"
chat_templatenoPer-model chat template override
jinja_explicitnoPer-model JINJA template override

[models.format] - Model Format

OptionDefaultDescription
formatauto"plain" (safetensors), "gguf", or "ggml"
quantized_filenoneQuantized filename(s) for GGUF/GGML (semicolon-separated)
tok_model_idnoneModel ID for tokenizer when using quantized format
gqa1GQA value for GGML models

[models.adapter] - LoRA/X-LoRA

OptionDescription
loraLoRA adapter ID(s), semicolon-separated
xloraX-LoRA adapter ID (conflicts with lora)
xlora_orderX-LoRA ordering JSON file (requires xlora)
tgt_non_granular_indexTarget non-granular index for X-LoRA

[models.quantization] - ISQ/UQFF

OptionDescription
in_situ_quantISQ level: "4", "8", "q4_0", "q4k", "q6k", etc.
from_uqffUQFF file(s) to load (semicolon-separated)
isq_organizationISQ strategy: "default" or "moqe"
imatriximatrix file for enhanced quantization
calibration_fileCalibration file for imatrix generation

[models.device] - Device Mapping

OptionDefaultDescription
cpufalseForce CPU-only (must be consistent across all models)
device_layersautoLayer mapping, e.g., ["0:10", "1:20"] format: ORD:NUM;...
topologynoneTopology YAML file
hf_cachenoneCustom HuggingFace cache directory
max_seq_len4096Max sequence length for auto device mapping
max_batch_size1Max batch size for auto device mapping

[models.vision] - Vision Options

OptionDescription
max_edgeMaximum edge length for image resizing
max_num_imagesMaximum images per request
max_image_lengthMaximum image dimension for device mapping

Full Examples

Multi-Model Server with UI

command = "serve"

[global]
seed = 42

[server]
host = "0.0.0.0"
port = 1234
ui = true

[runtime]
max_seqs = 32
enable_search = true
search_embedding_model = "embedding-gemma"

[paged_attn]
mode = "auto"

[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.2-3B-Instruct"
dtype = "auto"

[models.quantization]
in_situ_quant = "q4k"

[[models]]
kind = "vision"
model_id = "Qwen/Qwen2-VL-2B-Instruct"

[models.vision]
max_num_images = 4

[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"

Interactive Mode with Thinking

command = "run"
enable_thinking = true

[runtime]
max_seqs = 16

[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"

GGUF Model

command = "serve"

[server]
port = 1234

[[models]]
kind = "text"
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"

[models.format]
format = "gguf"
quantized_file = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
tok_model_id = "mistralai/Mistral-7B-Instruct-v0.1"

Device Layer Mapping

command = "serve"

[[models]]
kind = "auto"
model_id = "meta-llama/Llama-3.1-70B-Instruct"

[models.device]
device_layers = ["0:40", "1:40"]

[models.quantization]
in_situ_quant = "q4k"

Notes

  • cpu must be consistent across all models if specified
  • default_model_id (serve only) must match a model_id in [[models]]
  • search_embedding_model requires enable_search = true