mistralrs run

Run model in interactive mode, or one-shot mode with -i

mistralrs run [OPTIONS] [COMMAND]

Option	Default	Description
`-m, --model-id <MODEL_ID>`		HuggingFace model ID or local path to model directory
`-t, --tokenizer <TOKENIZER>`		Path to local tokenizer.json file
`-a, --arch <ARCH>`		Model architecture (auto-detected if not specified)
`--dtype <DTYPE>`	`auto`	Model data type
`--format <FORMAT>`		Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: `plain`, `gguf`, `ggml`.
`-f, --quantized-file <QUANTIZED_FILE>`		Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
`--tok-model-id <TOK_MODEL_ID>`		Model ID for tokenizer when using quantized format
`--gqa <GQA>`	`1`	GQA value for GGML models
`--enable-lora`	`false`	Enable dynamic LoRA without preloading an adapter. Supports text models. Qwen3.5/3.6 MoE requires automatic model selection; vision-tower adapters are unsupported
`--lora <ALIAS=SOURCE\|JSON>`		Preload a language-model LoRA adapter as ALIAS=SOURCE. Remote adapters use revision main. May be repeated. Qwen3.5/3.6 MoE conditional-generation models require auto model selection; vision-tower adapters are unsupported
`--lora-max-adapters <LORA_MAX_ADAPTERS>`	`16`	Maximum loaded LoRA aliases and, independently, resident adapter generations
`--lora-max-rank <LORA_MAX_RANK>`	`256`	Maximum rank accepted for a LoRA adapter
`--lora-max-bytes <BYTES>`	`8589934592`	Maximum memory used by loaded adapters
`--legacy-lora <SOURCE>`		Legacy LoRA adapter source for a raw GGUF or GGML model
`--legacy-lora-order <LEGACY_LORA_ORDER>`		Ordering JSON file for a legacy raw GGUF or GGML LoRA adapter
`--xlora <XLORA>`		X-LoRA adapter model ID
`--xlora-order <XLORA_ORDER>`		X-LoRA ordering JSON file
`--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>`		Target non-granular index for X-LoRA
`--quant <QUANT>`		Quantization front-door: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) This prefers prebuilt UQFF from `mistralrs-community/<model>-UQFF`, so use `--isq` if you do not want to switch to a prebuilt UQFF
`--isq <IN_SITU_QUANT>`		In-situ quantization: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) and quantizes the selected model in-place (in-situ)
`--from-uqff <FROM_UQFF>`		UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., `--from-uqff 8` finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
`--isq-organization <ISQ_ORGANIZATION>`		ISQ organization strategy: default or moqe
`--imatrix <IMATRIX>`		imatrix file for enhanced quantization
`--calibration-file <CALIBRATION_FILE>`		Calibration file for imatrix generation
`--cpu`	`false`	Force CPU-only execution
`-n, --device-layers <DEVICE_LAYERS>`		Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
`--topology <TOPOLOGY>`		Topology YAML file for device mapping
`--hf-cache <HF_CACHE>`		Custom HuggingFace cache directory
`--max-seq-len <MAX_SEQ_LEN>`	`4096`	Max sequence length for automatic device mapping
`--max-batch-size <MAX_BATCH_SIZE>`	`1`	Max batch size for automatic device mapping
`--paged-attn <MODE>`	`auto`	PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: `auto`, `on`, `off`.
`--pa-context-len <CONTEXT_LEN>`		Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
`--pa-memory-mb <MEMORY_MB>`		GPU memory to allocate in MBs (alternative to context-len)
`--pa-memory-fraction <MEMORY_FRACTION>`		GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
`--pa-block-size <BLOCK_SIZE>`		Tokens per block (default: 32 on CUDA)
`--pa-cache-type <CACHE_TYPE>`	`auto`	KV cache quantization type
`--max-edge <MAX_EDGE>`		Maximum edge length for image resizing (aspect ratio preserved)
`--max-num-images <MAX_NUM_IMAGES>`		Maximum number of images per request
`--max-image-length <MAX_IMAGE_LENGTH>`		Maximum image dimension for device mapping
`--max-seqs <MAX_SEQS>`	`32`	Maximum concurrent sequences
`--no-kv-cache`	`false`	Disable KV cache entirely
`--prefix-cache-n <PREFIX_CACHE_N>`	`16`	Number of prefix caches to hold (0 to disable)
`-c, --chat-template <CHAT_TEMPLATE>`		Custom chat template file (.json or .jinja)
`-j, --jinja-explicit <JINJA_EXPLICIT>`		Explicit JINJA template override
`--matformer-config-path <MATFORMER_CONFIG_PATH>`		Path to a MatFormer config (CSV/JSON describing available slices). See model card
`--matformer-slice-name <MATFORMER_SLICE_NAME>`		MatFormer slice to load (must match a slice name in the config file)
`--mtp-model <MTP_MODEL>`		MTP assistant model id or path
`--mtp-n-predict <MTP_N_PREDICT>`		Number of MTP draft tokens to propose per target step
`--mcp-config <MCP_CONFIG>`		Path to an MCP client configuration JSON. Also reads `MCP_CONFIG_PATH` if unset
`--agent`	`false`	Build a local agent: enables web search, Python code execution, and shell execution, runs the agentic tool loop with a per-session temp workdir. Equivalent to passing `--enable-search --enable-code-execution --enable-shell` together
`--enable-search`	`false`	Enable web search (requires embedding model)
`--search-embedding-model <SEARCH_EMBEDDING_MODEL>`		Search embedding model to use. Requires `--enable-search` or `--agent` Possible values: `embedding-gemma`.
`--enable-code-execution`	`false`	Enable Python code execution tool (WARNING: allows arbitrary code execution)
`--enable-shell`	`false`	Enable shell execution tool (WARNING: allows arbitrary command execution)
`--code-exec-python <CODE_EXEC_PYTHON>`		Python interpreter path for code execution. Requires code execution to be on (via `--enable-code-execution` or `--agent`). Defaults to `python3`
`--code-exec-timeout <CODE_EXEC_TIMEOUT>`		Code execution timeout in seconds (default: 60). Requires code execution to be on
`--code-exec-workdir <CODE_EXEC_WORKDIR>`		Working directory for code execution. Defaults to a temp dir; use ”.” for cwd. Requires code execution to be on
`--shell-path <SHELL_PATH>`		Shell executable path. Requires shell execution to be on. Defaults to /bin/sh
`--shell-timeout <SHELL_TIMEOUT>`		Shell execution timeout in seconds (default: 600). Requires shell execution to be on
`--shell-workdir <SHELL_WORKDIR>`		Root directory for per-session shell working directories. Defaults to temp dirs
`--skills-dir <SKILLS_DIR>`		Directory for uploaded OpenAI-compatible Skills. Defaults to the system temp directory
`--agent-permission <PERMISSION>`	`auto`	Agent action permission mode Possible values: `auto`, `ask`, `deny`.
`--sandbox <MODE>`	`auto`	Sandbox mode Possible values: `auto`, `on`, `off`.
`--sandbox-profile <PROFILE>`		Sandbox policy profile Possible values: `restricted`, `developer`.
`--sb-max-memory-mb <MEMORY_MB>`		Per-session memory cap in MiB (default: 2048)
`--sb-max-cpu-secs <CPU_SECS>`		Per-session CPU time cap in seconds (default: 600). Raised to at least enabled code/shell timeouts
`--sb-max-procs <PROCS>`		Per-session process/thread cap (default: 64)
`--sandbox-network <NETWORK>`		Network access permitted to the sandboxed session Possible values: `none`, `loopback`, `full`.
`--thinking <THINKING>`		Control thinking mode for models that support it. Use —thinking or —thinking true to force on, —thinking false to force off. Omit to defer to the chat template default Possible values: `true`, `false`.
`-i, --input <INPUT>`		One-shot text prompt. When provided, sends a single request and exits instead of entering interactive mode. Combine with —image, —video, or —audio for multimodal requests
`--image <IMAGE>`		Image URL(s) or file path(s) to include in the request (requires -i). Can be specified multiple times: —image img1.jpg —image img2.png
`--video <VIDEO>`		Video URL(s) or file path(s) to include in the request (requires -i). Can be specified multiple times: —video vid1.mp4 —video vid2.webm
`--audio <AUDIO>`		Audio URL(s) or file path(s) to include in the request (requires -i). Can be specified multiple times: —audio audio1.wav —audio audio2.mp3
`--adapter <ADAPTER>`		LoRA adapter alias to use for requests. Omit to run the base model

mistralrs run auto

Auto-detect model type (recommended)

mistralrs run auto [OPTIONS] --model-id <MODEL_ID>

Option	Default	Description
`-m, --model-id <MODEL_ID>`	required	HuggingFace model ID or local path to model directory
`-t, --tokenizer <TOKENIZER>`		Path to local tokenizer.json file
`-a, --arch <ARCH>`		Model architecture (auto-detected if not specified)
`--dtype <DTYPE>`	`auto`	Model data type
`--format <FORMAT>`		Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: `plain`, `gguf`, `ggml`.
`-f, --quantized-file <QUANTIZED_FILE>`		Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
`--tok-model-id <TOK_MODEL_ID>`		Model ID for tokenizer when using quantized format
`--gqa <GQA>`	`1`	GQA value for GGML models
`--enable-lora`	`false`	Enable dynamic LoRA without preloading an adapter. Supports text models. Qwen3.5/3.6 MoE requires automatic model selection; vision-tower adapters are unsupported
`--lora <ALIAS=SOURCE\|JSON>`		Preload a language-model LoRA adapter as ALIAS=SOURCE. Remote adapters use revision main. May be repeated. Qwen3.5/3.6 MoE conditional-generation models require auto model selection; vision-tower adapters are unsupported
`--lora-max-adapters <LORA_MAX_ADAPTERS>`	`16`	Maximum loaded LoRA aliases and, independently, resident adapter generations
`--lora-max-rank <LORA_MAX_RANK>`	`256`	Maximum rank accepted for a LoRA adapter
`--lora-max-bytes <BYTES>`	`8589934592`	Maximum memory used by loaded adapters
`--legacy-lora <SOURCE>`		Legacy LoRA adapter source for a raw GGUF or GGML model
`--legacy-lora-order <LEGACY_LORA_ORDER>`		Ordering JSON file for a legacy raw GGUF or GGML LoRA adapter
`--xlora <XLORA>`		X-LoRA adapter model ID
`--xlora-order <XLORA_ORDER>`		X-LoRA ordering JSON file
`--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>`		Target non-granular index for X-LoRA
`--quant <QUANT>`		Quantization front-door: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) This prefers prebuilt UQFF from `mistralrs-community/<model>-UQFF`, so use `--isq` if you do not want to switch to a prebuilt UQFF
`--isq <IN_SITU_QUANT>`		In-situ quantization: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) and quantizes the selected model in-place (in-situ)
`--from-uqff <FROM_UQFF>`		UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., `--from-uqff 8` finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
`--isq-organization <ISQ_ORGANIZATION>`		ISQ organization strategy: default or moqe
`--imatrix <IMATRIX>`		imatrix file for enhanced quantization
`--calibration-file <CALIBRATION_FILE>`		Calibration file for imatrix generation
`--cpu`	`false`	Force CPU-only execution
`-n, --device-layers <DEVICE_LAYERS>`		Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
`--topology <TOPOLOGY>`		Topology YAML file for device mapping
`--hf-cache <HF_CACHE>`		Custom HuggingFace cache directory
`--max-seq-len <MAX_SEQ_LEN>`	`4096`	Max sequence length for automatic device mapping
`--max-batch-size <MAX_BATCH_SIZE>`	`1`	Max batch size for automatic device mapping
`--paged-attn <MODE>`	`auto`	PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: `auto`, `on`, `off`.
`--pa-context-len <CONTEXT_LEN>`		Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
`--pa-memory-mb <MEMORY_MB>`		GPU memory to allocate in MBs (alternative to context-len)
`--pa-memory-fraction <MEMORY_FRACTION>`		GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
`--pa-block-size <BLOCK_SIZE>`		Tokens per block (default: 32 on CUDA)
`--pa-cache-type <CACHE_TYPE>`	`auto`	KV cache quantization type
`--max-edge <MAX_EDGE>`		Maximum edge length for image resizing (aspect ratio preserved)
`--max-num-images <MAX_NUM_IMAGES>`		Maximum number of images per request
`--max-image-length <MAX_IMAGE_LENGTH>`		Maximum image dimension for device mapping

mistralrs run text

Text generation model with explicit configuration

mistralrs run text [OPTIONS] --model-id <MODEL_ID>

Option	Default	Description
`-m, --model-id <MODEL_ID>`	required	HuggingFace model ID or local path to model directory
`-t, --tokenizer <TOKENIZER>`		Path to local tokenizer.json file
`-a, --arch <ARCH>`		Model architecture (auto-detected if not specified)
`--dtype <DTYPE>`	`auto`	Model data type
`--format <FORMAT>`		Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: `plain`, `gguf`, `ggml`.
`-f, --quantized-file <QUANTIZED_FILE>`		Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
`--tok-model-id <TOK_MODEL_ID>`		Model ID for tokenizer when using quantized format
`--gqa <GQA>`	`1`	GQA value for GGML models
`--enable-lora`	`false`	Enable dynamic LoRA without preloading an adapter. Supports text models. Qwen3.5/3.6 MoE requires automatic model selection; vision-tower adapters are unsupported
`--lora <ALIAS=SOURCE\|JSON>`		Preload a language-model LoRA adapter as ALIAS=SOURCE. Remote adapters use revision main. May be repeated. Qwen3.5/3.6 MoE conditional-generation models require auto model selection; vision-tower adapters are unsupported
`--lora-max-adapters <LORA_MAX_ADAPTERS>`	`16`	Maximum loaded LoRA aliases and, independently, resident adapter generations
`--lora-max-rank <LORA_MAX_RANK>`	`256`	Maximum rank accepted for a LoRA adapter
`--lora-max-bytes <BYTES>`	`8589934592`	Maximum memory used by loaded adapters
`--legacy-lora <SOURCE>`		Legacy LoRA adapter source for a raw GGUF or GGML model
`--legacy-lora-order <LEGACY_LORA_ORDER>`		Ordering JSON file for a legacy raw GGUF or GGML LoRA adapter
`--xlora <XLORA>`		X-LoRA adapter model ID
`--xlora-order <XLORA_ORDER>`		X-LoRA ordering JSON file
`--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX>`		Target non-granular index for X-LoRA
`--quant <QUANT>`		Quantization front-door: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) This prefers prebuilt UQFF from `mistralrs-community/<model>-UQFF`, so use `--isq` if you do not want to switch to a prebuilt UQFF
`--isq <IN_SITU_QUANT>`		In-situ quantization: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) and quantizes the selected model in-place (in-situ)
`--from-uqff <FROM_UQFF>`		UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., `--from-uqff 8` finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
`--isq-organization <ISQ_ORGANIZATION>`		ISQ organization strategy: default or moqe
`--imatrix <IMATRIX>`		imatrix file for enhanced quantization
`--calibration-file <CALIBRATION_FILE>`		Calibration file for imatrix generation
`--cpu`	`false`	Force CPU-only execution
`-n, --device-layers <DEVICE_LAYERS>`		Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
`--topology <TOPOLOGY>`		Topology YAML file for device mapping
`--hf-cache <HF_CACHE>`		Custom HuggingFace cache directory
`--max-seq-len <MAX_SEQ_LEN>`	`4096`	Max sequence length for automatic device mapping
`--max-batch-size <MAX_BATCH_SIZE>`	`1`	Max batch size for automatic device mapping
`--paged-attn <MODE>`	`auto`	PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: `auto`, `on`, `off`.
`--pa-context-len <CONTEXT_LEN>`		Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
`--pa-memory-mb <MEMORY_MB>`		GPU memory to allocate in MBs (alternative to context-len)
`--pa-memory-fraction <MEMORY_FRACTION>`		GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
`--pa-block-size <BLOCK_SIZE>`		Tokens per block (default: 32 on CUDA)
`--pa-cache-type <CACHE_TYPE>`	`auto`	KV cache quantization type

mistralrs run multimodal

Multimodal model

mistralrs run multimodal [OPTIONS] --model-id <MODEL_ID>

Option	Default	Description
`-m, --model-id <MODEL_ID>`	required	HuggingFace model ID or local path to model directory
`-t, --tokenizer <TOKENIZER>`		Path to local tokenizer.json file
`-a, --arch <ARCH>`		Model architecture (auto-detected if not specified)
`--dtype <DTYPE>`	`auto`	Model data type
`--format <FORMAT>`		Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: `plain`, `gguf`, `ggml`.
`-f, --quantized-file <QUANTIZED_FILE>`		Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
`--tok-model-id <TOK_MODEL_ID>`		Model ID for tokenizer when using quantized format
`--gqa <GQA>`	`1`	GQA value for GGML models
`--quant <QUANT>`		Quantization front-door: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) This prefers prebuilt UQFF from `mistralrs-community/<model>-UQFF`, so use `--isq` if you do not want to switch to a prebuilt UQFF
`--isq <IN_SITU_QUANT>`		In-situ quantization: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) and quantizes the selected model in-place (in-situ)
`--from-uqff <FROM_UQFF>`		UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., `--from-uqff 8` finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
`--isq-organization <ISQ_ORGANIZATION>`		ISQ organization strategy: default or moqe
`--imatrix <IMATRIX>`		imatrix file for enhanced quantization
`--calibration-file <CALIBRATION_FILE>`		Calibration file for imatrix generation
`--cpu`	`false`	Force CPU-only execution
`-n, --device-layers <DEVICE_LAYERS>`		Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
`--topology <TOPOLOGY>`		Topology YAML file for device mapping
`--hf-cache <HF_CACHE>`		Custom HuggingFace cache directory
`--max-seq-len <MAX_SEQ_LEN>`	`4096`	Max sequence length for automatic device mapping
`--max-batch-size <MAX_BATCH_SIZE>`	`1`	Max batch size for automatic device mapping
`--paged-attn <MODE>`	`auto`	PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: `auto`, `on`, `off`.
`--pa-context-len <CONTEXT_LEN>`		Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
`--pa-memory-mb <MEMORY_MB>`		GPU memory to allocate in MBs (alternative to context-len)
`--pa-memory-fraction <MEMORY_FRACTION>`		GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
`--pa-block-size <BLOCK_SIZE>`		Tokens per block (default: 32 on CUDA)
`--pa-cache-type <CACHE_TYPE>`	`auto`	KV cache quantization type
`--max-edge <MAX_EDGE>`		Maximum edge length for image resizing (aspect ratio preserved)
`--max-num-images <MAX_NUM_IMAGES>`		Maximum number of images per request
`--max-image-length <MAX_IMAGE_LENGTH>`		Maximum image dimension for device mapping

mistralrs run diffusion

Image generation model (diffusion)

mistralrs run diffusion [OPTIONS] --model-id <MODEL_ID>

Option	Default	Description
`-m, --model-id <MODEL_ID>`	required	HuggingFace model ID or local path to model directory
`-t, --tokenizer <TOKENIZER>`		Path to local tokenizer.json file
`-a, --arch <ARCH>`		Model architecture (auto-detected if not specified)
`--dtype <DTYPE>`	`auto`	Model data type
`--cpu`	`false`	Force CPU-only execution
`-n, --device-layers <DEVICE_LAYERS>`		Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
`--topology <TOPOLOGY>`		Topology YAML file for device mapping
`--hf-cache <HF_CACHE>`		Custom HuggingFace cache directory
`--max-seq-len <MAX_SEQ_LEN>`	`4096`	Max sequence length for automatic device mapping
`--max-batch-size <MAX_BATCH_SIZE>`	`1`	Max batch size for automatic device mapping

mistralrs run speech

Speech synthesis model

mistralrs run speech [OPTIONS] --model-id <MODEL_ID>

Option	Default	Description
`-m, --model-id <MODEL_ID>`	required	HuggingFace model ID or local path to model directory
`-t, --tokenizer <TOKENIZER>`		Path to local tokenizer.json file
`-a, --arch <ARCH>`		Model architecture (auto-detected if not specified)
`--dtype <DTYPE>`	`auto`	Model data type
`--cpu`	`false`	Force CPU-only execution
`-n, --device-layers <DEVICE_LAYERS>`		Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
`--topology <TOPOLOGY>`		Topology YAML file for device mapping
`--hf-cache <HF_CACHE>`		Custom HuggingFace cache directory
`--max-seq-len <MAX_SEQ_LEN>`	`4096`	Max sequence length for automatic device mapping
`--max-batch-size <MAX_BATCH_SIZE>`	`1`	Max batch size for automatic device mapping

mistralrs run embedding

Embedding model

mistralrs run embedding [OPTIONS] --model-id <MODEL_ID>

Option	Default	Description
`-m, --model-id <MODEL_ID>`	required	HuggingFace model ID or local path to model directory
`-t, --tokenizer <TOKENIZER>`		Path to local tokenizer.json file
`-a, --arch <ARCH>`		Model architecture (auto-detected if not specified)
`--dtype <DTYPE>`	`auto`	Model data type
`--format <FORMAT>`		Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: `plain`, `gguf`, `ggml`.
`-f, --quantized-file <QUANTIZED_FILE>`		Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple)
`--tok-model-id <TOK_MODEL_ID>`		Model ID for tokenizer when using quantized format
`--gqa <GQA>`	`1`	GQA value for GGML models
`--quant <QUANT>`		Quantization front-door: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) This prefers prebuilt UQFF from `mistralrs-community/<model>-UQFF`, so use `--isq` if you do not want to switch to a prebuilt UQFF
`--isq <IN_SITU_QUANT>`		In-situ quantization: accepts numeric levels (`2`, `3`, `4`, `5`, `6`, `8`) or raw quant names (`q4k`, `q8_0`, etc.) and quantizes the selected model in-place (in-situ)
`--from-uqff <FROM_UQFF>`		UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., `--from-uqff 8` finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations
`--isq-organization <ISQ_ORGANIZATION>`		ISQ organization strategy: default or moqe
`--imatrix <IMATRIX>`		imatrix file for enhanced quantization
`--calibration-file <CALIBRATION_FILE>`		Calibration file for imatrix generation
`--cpu`	`false`	Force CPU-only execution
`-n, --device-layers <DEVICE_LAYERS>`		Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”) Omit for automatic device mapping
`--topology <TOPOLOGY>`		Topology YAML file for device mapping
`--hf-cache <HF_CACHE>`		Custom HuggingFace cache directory
`--max-seq-len <MAX_SEQ_LEN>`	`4096`	Max sequence length for automatic device mapping
`--max-batch-size <MAX_BATCH_SIZE>`	`1`	Max batch size for automatic device mapping
`--paged-attn <MODE>`	`auto`	PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: `auto`, `on`, `off`.
`--pa-context-len <CONTEXT_LEN>`		Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM
`--pa-memory-mb <MEMORY_MB>`		GPU memory to allocate in MBs (alternative to context-len)
`--pa-memory-fraction <MEMORY_FRACTION>`		GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb)
`--pa-block-size <BLOCK_SIZE>`		Tokens per block (default: 32 on CUDA)
`--pa-cache-type <CACHE_TYPE>`	`auto`	KV cache quantization type