Skip to content

mistralrs serve

Start HTTP/MCP server and (optionally) the UI at /ui

mistralrs serve [OPTIONS] [COMMAND]
OptionDefaultDescription
-m, --model-id <MODEL_ID>HuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--lora <LORA>LoRA adapter model ID(s), semicolon-separated for multiple
--xlora <XLORA>X-LoRA adapter model ID
--xlora-order <XLORA_ORDER>X-LoRA ordering JSON file
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>Target non-granular index for X-LoRA
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping
-p, --port <PORT>1234HTTP server port
--host <HOST>0.0.0.0Bind address
--no-uifalseDisable the built-in web UI (served at /ui by default)
--mcp-port <MCP_PORT>Also expose the loaded model as an MCP server on this port (JSON-RPC 2.0 at POST /mcp)
--max-tool-rounds <MAX_TOOL_ROUNDS>Default maximum tool-call rounds for the agentic loop. Per-request values from the HTTP API override this. Safety cap: 256 if unset
--tool-dispatch-url <TOOL_DISPATCH_URL>URL to POST tool calls to for server-side execution. For security, this is only configurable server-side (not per-request via HTTP API)
--max-seqs <MAX_SEQS>32Maximum concurrent sequences
--no-kv-cachefalseDisable KV cache entirely
--prefix-cache-n <PREFIX_CACHE_N>16Number of prefix caches to hold (0 to disable)
-c, --chat-template <CHAT_TEMPLATE>Custom chat template file (.json or .jinja)
-j, --jinja-explicit <JINJA_EXPLICIT>Explicit JINJA template override
--matformer-config-path <MATFORMER_CONFIG_PATH>Path to a MatFormer config (CSV/JSON describing available slices). See model card
--matformer-slice-name <MATFORMER_SLICE_NAME>MatFormer slice to load (must match a slice name in the config file)
--mtp-model <MTP_MODEL>MTP assistant model id or path
--mtp-n-predict <MTP_N_PREDICT>Number of MTP draft tokens to propose per target step
--mcp-config <MCP_CONFIG>Path to an MCP client configuration JSON. Also reads MCP_CONFIG_PATH if unset
--agentfalseBuild a local agent: enables web search and Python code execution, runs the agentic tool loop with a per-session temp workdir. Equivalent to passing --enable-search --enable-code-execution together
--enable-searchfalseEnable web search (requires embedding model)
--search-embedding-model <SEARCH_EMBEDDING_MODEL>Search embedding model to use. Requires --enable-search or --agent Possible values: embedding-gemma.
--enable-code-executionfalseEnable Python code execution tool (WARNING: allows arbitrary code execution)
--code-exec-python <CODE_EXEC_PYTHON>Python interpreter path for code execution. Requires code execution to be on (via --enable-code-execution or --agent). Defaults to python3
--code-exec-timeout <CODE_EXEC_TIMEOUT>Code execution timeout in seconds (default: 30). Requires code execution to be on
--code-exec-workdir <CODE_EXEC_WORKDIR>Working directory for code execution. Defaults to a temp dir; use ”.” for cwd. Requires code execution to be on
--agent-permission <PERMISSION>autoAgent action permission mode Possible values: auto, ask, deny.
--sandbox <MODE>autoSandbox mode Possible values: auto, on, off.
--sb-max-memory-mb <MEMORY_MB>Per-session memory cap in MiB (default: 2048)
--sb-max-cpu-secs <CPU_SECS>Per-session CPU time cap in seconds (default: 300)
--sb-max-procs <PROCS>Per-session process/thread cap (default: 64)
--sandbox-network <NETWORK>loopbackNetwork access permitted to the sandboxed session Possible values: none, loopback, full.

Auto-detect model type (recommended)

mistralrs serve auto [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--lora <LORA>LoRA adapter model ID(s), semicolon-separated for multiple
--xlora <XLORA>X-LoRA adapter model ID
--xlora-order <XLORA_ORDER>X-LoRA ordering JSON file
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>Target non-granular index for X-LoRA
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

Text generation model with explicit configuration

mistralrs serve text [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--lora <LORA>LoRA adapter model ID(s), semicolon-separated for multiple
--xlora <XLORA>X-LoRA adapter model ID
--xlora-order <XLORA_ORDER>X-LoRA ordering JSON file
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>Target non-granular index for X-LoRA
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type

Multimodal model

mistralrs serve multimodal [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--lora <LORA>LoRA adapter model ID(s), semicolon-separated for multiple
--xlora <XLORA>X-LoRA adapter model ID
--xlora-order <XLORA_ORDER>X-LoRA ordering JSON file
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>Target non-granular index for X-LoRA
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

Image generation model (diffusion)

mistralrs serve diffusion [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping

Speech synthesis model

mistralrs serve speech [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping

Embedding model

mistralrs serve embedding [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type