Skip to content

Quantization tradeoffs

--isq quantizes weights as the model loads. The full-precision weights are never resident in memory at the same time; the engine reads each weight, produces its quantized form, and discards the source before moving to the next.

First-run load is slower than pre-quantized formats (GGUF, UQFF), which have no conversion work to do.

--isq N maps to a specific format based on the active device:

NMetalCUDA / CPU
2AFQ2Q2K
3AFQ3Q3K
4AFQ4Q4K
5Q5KQ5K
6AFQ6Q6K
8AFQ8Q8_0

Explicit format names (q4k, afq8, etc.) bypass the device check. Incompatible combinations (e.g., FP8 on pre-8.9 GPUs) are rejected at load time.

  • Q*K (Q2K, Q3K, Q4K, Q5K, Q6K): GGML-compatible block quantization. Broadly applicable.
  • AFQ (AFQ2, AFQ3, AFQ4, AFQ6, AFQ8): optimized for Apple Silicon. Also usable on CUDA.
  • Legacy GGML (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1): supported for GGUF compatibility.
  • FP8 (F8E4M3): native FP8 matmul on compute capability 8.9+.
  • MXFP4: 4-bit microscaling; native fast path on Blackwell.
  • HQQ (HQQ4, HQQ8): alternative 4- and 8-bit schemes.

The numeric shorthand picks a format the active device supports; the explicit names override that.

--isq-organization selects which layers get quantized:

  • default: every linear layer the pipeline exposes for quantization.
  • moqe: only MoE expert layers; the shared (non-expert) trunk stays at native precision.

moqe is useful on mixture-of-experts models where the experts dominate parameter count. Non-MoE models do nothing meaningful with it.

An importance matrix is a per-weight scaling factor derived from running the unquantized model on calibration data and measuring each weight’s contribution to output activations. The quantizer uses it to allocate precision to higher-impact weights.

Two flags:

  • --imatrix <path>: load an existing imatrix file.
  • --calibration-file <path>: generate an imatrix from calibration text at load time.

The two conflict. --imatrix is reused across runs; --calibration-file re-generates on every load. imatrix affects the Q*K and HQQ formats; AFQ and legacy GGML formats are unaffected.

Interaction with paged attention and flash attention

Section titled “Interaction with paged attention and flash attention”

ISQ applies to weights. The KV cache is a separate budget, paged attention manages its memory independently, and --pa-cache-type quantizes the cache itself.

Flash attention operates on activations, not weights, and composes with any ISQ format.

UQFF files are a serialized form of an ISQ-quantized model. mistralrs quantize runs ISQ and writes the result; --from-uqff loads that file without re-running the quantization step. Quality is identical at the same ISQ type; only load time differs.