Skip to content

mistralrs run

Run model in interactive mode, or one-shot mode with -i

mistralrs run [OPTIONS] [COMMAND]
OptionDefaultDescription
-m, --model-id <MODEL_ID>HuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--lora <LORA>LoRA adapter model ID(s), semicolon-separated for multiple
--xlora <XLORA>X-LoRA adapter model ID
--xlora-order <XLORA_ORDER>X-LoRA ordering JSON file
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>Target non-granular index for X-LoRA
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping
--max-seqs <MAX_SEQS>32Maximum concurrent sequences
--no-kv-cachefalseDisable KV cache entirely
--prefix-cache-n <PREFIX_CACHE_N>16Number of prefix caches to hold (0 to disable)
-c, --chat-template <CHAT_TEMPLATE>Custom chat template file (.json or .jinja)
-j, --jinja-explicit <JINJA_EXPLICIT>Explicit JINJA template override
--matformer-config-path <MATFORMER_CONFIG_PATH>Path to a MatFormer config (CSV/JSON describing available slices). See model card
--matformer-slice-name <MATFORMER_SLICE_NAME>MatFormer slice to load (must match a slice name in the config file)
--mtp-model <MTP_MODEL>MTP assistant model id or path
--mtp-n-predict <MTP_N_PREDICT>Number of MTP draft tokens to propose per target step
--mcp-config <MCP_CONFIG>Path to an MCP client configuration JSON. Also reads MCP_CONFIG_PATH if unset
--agentfalseBuild a local agent: enables web search and Python code execution, runs the agentic tool loop with a per-session temp workdir. Equivalent to passing --enable-search --enable-code-execution together
--enable-searchfalseEnable web search (requires embedding model)
--search-embedding-model <SEARCH_EMBEDDING_MODEL>Search embedding model to use. Requires --enable-search or --agent Possible values: embedding-gemma.
--enable-code-executionfalseEnable Python code execution tool (WARNING: allows arbitrary code execution)
--code-exec-python <CODE_EXEC_PYTHON>Python interpreter path for code execution. Requires code execution to be on (via --enable-code-execution or --agent). Defaults to python3
--code-exec-timeout <CODE_EXEC_TIMEOUT>Code execution timeout in seconds (default: 30). Requires code execution to be on
--code-exec-workdir <CODE_EXEC_WORKDIR>Working directory for code execution. Defaults to a temp dir; use ”.” for cwd. Requires code execution to be on
--agent-permission <PERMISSION>autoAgent action permission mode Possible values: auto, ask, deny.
--sandbox <MODE>autoSandbox mode Possible values: auto, on, off.
--sb-max-memory-mb <MEMORY_MB>Per-session memory cap in MiB (default: 2048)
--sb-max-cpu-secs <CPU_SECS>Per-session CPU time cap in seconds (default: 300)
--sb-max-procs <PROCS>Per-session process/thread cap (default: 64)
--sandbox-network <NETWORK>loopbackNetwork access permitted to the sandboxed session Possible values: none, loopback, full.
--thinking <THINKING>Control thinking mode for models that support it. Use —thinking or —thinking true to force on, —thinking false to force off. Omit to defer to the chat template default Possible values: true, false.
-i, --input <INPUT>One-shot text prompt. When provided, sends a single request and exits instead of entering interactive mode. Combine with —image, —video, or —audio for multimodal requests
--image <IMAGE>Image URL(s) or file path(s) to include in the request (requires -i). Can be specified multiple times: —image img1.jpg —image img2.png
--video <VIDEO>Video URL(s) or file path(s) to include in the request (requires -i). Can be specified multiple times: —video vid1.mp4 —video vid2.webm
--audio <AUDIO>Audio URL(s) or file path(s) to include in the request (requires -i). Can be specified multiple times: —audio audio1.wav —audio audio2.mp3

Auto-detect model type (recommended)

mistralrs run auto [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--lora <LORA>LoRA adapter model ID(s), semicolon-separated for multiple
--xlora <XLORA>X-LoRA adapter model ID
--xlora-order <XLORA_ORDER>X-LoRA ordering JSON file
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>Target non-granular index for X-LoRA
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

Text generation model with explicit configuration

mistralrs run text [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--lora <LORA>LoRA adapter model ID(s), semicolon-separated for multiple
--xlora <XLORA>X-LoRA adapter model ID
--xlora-order <XLORA_ORDER>X-LoRA ordering JSON file
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>Target non-granular index for X-LoRA
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type

Multimodal model

mistralrs run multimodal [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--lora <LORA>LoRA adapter model ID(s), semicolon-separated for multiple
--xlora <XLORA>X-LoRA adapter model ID
--xlora-order <XLORA_ORDER>X-LoRA ordering JSON file
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>Target non-granular index for X-LoRA
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

Image generation model (diffusion)

mistralrs run diffusion [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping

Speech synthesis model

mistralrs run speech [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping

Embedding model

mistralrs run embedding [OPTIONS] --model-id <MODEL_ID>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredHuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
-a, --arch <ARCH>Model architecture (auto-detected if not specified)
--dtype <DTYPE>autoModel data type
--format <FORMAT>Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml.
-f, --quantized-file <QUANTIZED_FILE>Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
--tok-model-id <TOK_MODEL_ID>Model ID for tokenizer when using quantized format
--gqa <GQA>1GQA value for GGML models
--quant <QUANT>Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob
--isq <IN_SITU_QUANT>In-situ quantization level (e.g., “4”, “8”, “q4_0”, “q4_1”, etc.)
--from-uqff <FROM_UQFF>UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
--paged-attn <MODE>autoPagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off.
--pa-context-len <CONTEXT_LEN>Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
--pa-memory-mb <MEMORY_MB>GPU memory to allocate in MBs (alternative to context-len)
--pa-memory-fraction <MEMORY_FRACTION>GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
--pa-block-size <BLOCK_SIZE>Tokens per block (default: 32 on CUDA)
--pa-cache-type <CACHE_TYPE>autoKV cache quantization type