Quantization tradeoffs
In-situ quantization
Section titled “In-situ quantization”--isq quantizes weights as the model loads. The full-precision weights are never resident in memory at the same time; the engine reads each weight, produces its quantized form, and discards the source before moving to the next.
First-run load is slower than pre-quantized formats (GGUF, UQFF), which have no conversion work to do.
Numeric shorthand resolution
Section titled “Numeric shorthand resolution”--isq N maps to a specific format based on the active device:
| N | Metal | CUDA / CPU |
|---|---|---|
| 2 | AFQ2 | Q2K |
| 3 | AFQ3 | Q3K |
| 4 | AFQ4 | Q4K |
| 5 | Q5K | Q5K |
| 6 | AFQ6 | Q6K |
| 8 | AFQ8 | Q8_0 |
Explicit format names (q4k, afq8, etc.) bypass the device check. Incompatible combinations (e.g., FP8 on pre-8.9 GPUs) are rejected at load time.
Format families
Section titled “Format families”- Q*K (
Q2K,Q3K,Q4K,Q5K,Q6K): GGML-compatible block quantization. Broadly applicable. - AFQ (
AFQ2,AFQ3,AFQ4,AFQ6,AFQ8): optimized for Apple Silicon. Also usable on CUDA. - Legacy GGML (
Q4_0,Q4_1,Q5_0,Q5_1,Q8_0,Q8_1): supported for GGUF compatibility. - FP8 (
F8E4M3): native FP8 matmul on compute capability 8.9+. - MXFP4: 4-bit microscaling; native fast path on Blackwell.
- HQQ (
HQQ4,HQQ8): alternative 4- and 8-bit schemes.
The numeric shorthand picks a format the active device supports; the explicit names override that.
Organization: default vs moqe
Section titled “Organization: default vs moqe”--isq-organization selects which layers get quantized:
default: every linear layer the pipeline exposes for quantization.moqe: only MoE expert layers; the shared (non-expert) trunk stays at native precision.
moqe is useful on mixture-of-experts models where the experts dominate parameter count. Non-MoE models do nothing meaningful with it.
imatrix
Section titled “imatrix”An importance matrix is a per-weight scaling factor derived from running the unquantized model on calibration data and measuring each weight’s contribution to output activations. The quantizer uses it to allocate precision to higher-impact weights.
Two flags:
--imatrix <path>: load an existing imatrix file.--calibration-file <path>: generate an imatrix from calibration text at load time.
The two conflict. --imatrix is reused across runs; --calibration-file re-generates on every load. imatrix affects the Q*K and HQQ formats; AFQ and legacy GGML formats are unaffected.
Interaction with paged attention and flash attention
Section titled “Interaction with paged attention and flash attention”ISQ applies to weights. The KV cache is a separate budget, paged attention manages its memory independently, and --pa-cache-type quantizes the cache itself.
Flash attention operates on activations, not weights, and composes with any ISQ format.
UQFF files are a serialized form of an ISQ-quantized model. mistralrs quantize runs ISQ and writes the result; --from-uqff loads that file without re-running the quantization step. Quality is identical at the same ISQ type; only load time differs.
See also
Section titled “See also”- Guide: pick a quantization.
- Reference: quantization types, UQFF format.