Skip to content

mistralrs quantize

Generate UQFF quantized model file

mistralrs quantize [OPTIONS] [COMMAND]
OptionDefaultDescription
-m, --model-id <MODEL_ID>HuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
--isq <IN_SITU_QUANT>In-situ quantization level(s). Multiple values can be comma-separated or specified via repeated —isq flags (e.g., “—isq q4k,q8_0” or “—isq q4k —isq q8_0”)
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”)
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>Output path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type)
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

Auto-detect model type (recommended)

mistralrs quantize auto [OPTIONS] --model-id <MODEL_ID> --isq <IN_SITU_QUANT> --output <OUTPUT_PATH>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredModel ID to load (HuggingFace repo or local path)
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
--isq <IN_SITU_QUANT>requiredIn-situ quantization level(s). Multiple values can be comma-separated or specified via repeated —isq flags (e.g., “—isq q4k,q8_0” or “—isq q4k —isq q8_0”)
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”)
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>requiredOutput path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type). Examples: -o model/model-q4k.uqff or -o output/
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

Text generation model with explicit architecture

mistralrs quantize text [OPTIONS] --model-id <MODEL_ID> --isq <IN_SITU_QUANT> --output <OUTPUT_PATH>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredModel ID to load (HuggingFace repo or local path)
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
-a, --arch <ARCH>Model architecture (required for text models)
--isq <IN_SITU_QUANT>requiredIn-situ quantization level(s). Multiple values can be comma-separated or specified via repeated —isq flags (e.g., “—isq q4k,q8_0” or “—isq q4k —isq q8_0”)
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”)
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>requiredOutput path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type). Examples: -o model/model-q4k.uqff or -o output/
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)

Multimodal model

mistralrs quantize multimodal [OPTIONS] --model-id <MODEL_ID> --isq <IN_SITU_QUANT> --output <OUTPUT_PATH>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredModel ID to load (HuggingFace repo or local path)
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
--isq <IN_SITU_QUANT>requiredIn-situ quantization level(s). Multiple values can be comma-separated or specified via repeated —isq flags (e.g., “—isq q4k,q8_0” or “—isq q4k —isq q8_0”)
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”)
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>requiredOutput path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type). Examples: -o model/model-q4k.uqff or -o output/
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

Embedding model

mistralrs quantize embedding [OPTIONS] --model-id <MODEL_ID> --isq <IN_SITU_QUANT> --output <OUTPUT_PATH>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredModel ID to load (HuggingFace repo or local path)
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
--isq <IN_SITU_QUANT>requiredIn-situ quantization level(s). Multiple values can be comma-separated or specified via repeated —isq flags (e.g., “—isq q4k,q8_0” or “—isq q4k —isq q8_0”)
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;… e.g., “0:10;1:20”)
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>requiredOutput path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type). Examples: -o model/model-q4k.uqff or -o output/
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)