Skip to content

CLI TOML configuration

mistralrs from-config -f <path> reads a TOML file. The top level is tagged by the command field and selects either serve or run.

command = "serve"
[server]
host = "0.0.0.0"
port = 1234
[[models]]
kind = "text"
model_id = "Qwen/Qwen3-4B"
[models.quantization]
in_situ_quant = "4"

mistralrs from-config -f this.toml runs the server.

FieldTypeRequiredPurpose
commandstringyes"serve" or "run".
default_model_idstringno (serve only)Model id treated as the default. Must match one of the [[models]] entries.
FieldTypeDefaultPurpose
seedintnot setSampling seed.
logpathnot setLog file for requests/responses.
token_sourcestringcacheToken source string (literal:<token>, env:<var>, path:<file>, cache, none).

Most CLI runtime flags map to fields here. Notable ones:

FieldDefaultPurpose
agentfalseShortcut for enable_search = true + enable_code_execution = true. Equivalent to --agent/--agentic on the CLI.
enable_searchfalseEnable web search tool.
search_embedding_modelnot setembedding-gemma. Requires enable_search = true (or agent = true).
enable_code_executionfalseEnable Python code execution.
code_exec_pythonpython on Windows, python3 elsewherePython interpreter for code execution. Requires enable_code_execution = true (or agent = true).
code_exec_workdirper-session temp dirCode execution working directory. Requires enable_code_execution = true (or agent = true).
code_exec_timeout30Code execution timeout (seconds). Requires enable_code_execution = true (or agent = true).
code_exec_permissionautoauto, ask, or deny. Requires enable_code_execution = true (or agent = true).
max_seqs32Max concurrent sequences.
prefix_cache_n16Prefix caches retained.
FieldTypeDefaultPurpose
hoststring0.0.0.0Bind address.
portu161234TCP port.
no_uiboolfalseDisable the built-in web UI (mounted at /ui by default).
mcp_portu16not setEnable MCP server on this port.
mcp_configpathnot setMCP client configuration (outbound).
max_tool_roundsintnot setCap on tool loop rounds.
tool_dispatch_urlstringnot setExternal URL for tool execution.
FieldDefaultPurpose
modeautoauto, on, or off.
context_lennot setKV cache context length.
memory_mbnot setKV cache budget in MB.
memory_fractionnot setKV cache budget as fraction of VRAM.
block_sizenot setTokens per block.
cache_typeautoKV cache quantization type.

Each entry defines one loaded model.

FieldTypeRequiredPurpose
kindenumyesauto, text, multimodal, diffusion, speech, or embedding.
model_idstringyesHugging Face id or local path.
tokenizerpathnoLocal tokenizer.json.
archenumnoArchitecture override (text models).
dtypeenumnoauto, f16, bf16, f32.
chat_templatepathnoChat template override.
jinja_explicitpathnoInline Jinja override.
matformer_config_pathpathnoPath to a MatFormer slice config (CSV/JSON).
matformer_slice_namestringnoMatFormer slice to load.

Per-model nested sections: [models.format], [models.adapter], [models.quantization], [models.device], [models.multimodal]. Field shapes mirror the corresponding CLI flags. cpu in [models.device] must be consistent across every entry.

command = "serve"
default_model_id = "Qwen/Qwen3-4B"
[server]
host = "0.0.0.0"
port = 1234
[runtime]
enable_search = true
search_embedding_model = "embedding-gemma"
[[models]]
kind = "text"
model_id = "Qwen/Qwen3-4B"
[models.quantization]
in_situ_quant = "4"
[[models]]
kind = "multimodal"
model_id = "google/gemma-4-E4B-it"
[models.quantization]
in_situ_quant = "4"

Configs are validated at startup. Invalid configs abort the run with a message identifying the problem. Validation includes:

  • At least one entry in [[models]].
  • default_model_id matches a model_id in [[models]].
  • cpu is consistent across all models when set.
  • search_embedding_model requires enable_search = true (or agent = true).
  • code_exec_python, code_exec_timeout, code_exec_workdir, and code_exec_permission each require enable_code_execution = true (or agent = true).