Skip to content

Quantize a model

Quantization stores weights at lower precision so a model fits in less memory. A 14B model in BF16 (2 bytes/param) needs about 28 GB for weights; at 4 bits the same model is about 7 GB. --quant is the front door:

Terminal window
mistralrs run --quant 4 -m google/gemma-4-E4B-it

--quant 4 first looks for a matching prebuilt UQFF (Universal Quantized File Format) in mistralrs-community/<model>-UQFF and loads it directly if one is published. Otherwise it falls back to ISQ (in-situ quantization): the engine quantizes weights on the fly as they load, picking a hardware-appropriate format (AFQ4 on Metal, Q4K elsewhere). With ISQ the full unquantized model is never resident in memory; loading just takes longer than a pre-quantized file.

For a local model path, --quant skips the UQFF probe and applies ISQ directly.

Footprint scales roughly linearly with bits per weight: a BF16 model uses about half the memory at --quant 8 and a quarter at --quant 4. The KV cache is a separate budget that depends on context length, not on weight quantization. Use nvidia-smi (or equivalent) to measure, and mistralrs tune to estimate before downloading anything.

Supported widths: 2, 3, 4, 5, 6, 8. Fewer bits means less memory and more quality loss:

  • 8 bits is near-lossless.
  • 5-6 bits is a good middle ground.
  • 4 bits is the common sweet spot for fitting larger models.
  • 2-3 bits degrades noticeably; use only when nothing else fits.

At 4 bits and below, an importance matrix recovers a meaningful amount of quality.

--quant also accepts format names:

Terminal window
mistralrs run --quant q4k -m google/gemma-4-E4B-it # GGML K-quant
mistralrs run --quant afq4 -m google/gemma-4-E4B-it # AFQ, Metal-optimized
mistralrs run --quant q8_0 -m google/gemma-4-E4B-it # the GGUF standard 8-bit

How the numeric shorthands resolve per backend, plus the full type list and hardware constraints, is in the quantization types reference.

Use --isq instead of --quant only when you want to force runtime ISQ and skip the UQFF lookup.

--quant auto estimates what fits on the current host and picks for you:

Terminal window
mistralrs run --quant auto -m google/gemma-4-E4B-it

It runs the same analysis as mistralrs tune with the balanced profile. If the model fits at full precision, no quantization is applied; otherwise the recommended type is resolved like an explicit --quant value (prebuilt UQFF preferred, ISQ fallback).

mistralrs tune prints the full analysis instead of acting on it:

Terminal window
mistralrs tune -m google/gemma-4-E4B-it

Output is a table with columns Quant | Est. Size | VRAM % | Context Room | Quality | Status. One row is marked Recommended; the others are marked Fits, Hybrid (split across GPU and CPU), or Too Large. A recommended command line is printed below the table.

tune is a config-based estimator, not a benchmark. It downloads only the model’s config files, computes per-quantization sizes from the architecture, and checks them against detected VRAM. No weights are downloaded, no model is loaded, and the quality column is a fixed tier per format (Baseline, Near-lossless, Good, Acceptable, Degraded), not a measurement.

Variations:

Terminal window
mistralrs tune --profile quality -m google/gemma-4-E4B-it # quality, balanced (default), fast
mistralrs tune --isq q4k -m google/gemma-4-E4B-it # bias toward a specific target
mistralrs tune --json -m google/gemma-4-E4B-it # machine-readable output
mistralrs tune --emit-config gemma.toml -m google/gemma-4-E4B-it

--emit-config writes a TOML config with the recommended settings; run it with mistralrs from-config -f gemma.toml. tune rejects --quant auto since tune is the recommender. All flags: CLI reference.

--quant for the resolve-then-fallback front door, --isq to force runtime ISQ:

Terminal window
mistralrs serve -m google/gemma-4-E4B-it --quant 4
mistralrs serve -m google/gemma-4-E4B-it --isq q4k

To quantize once and reuse the result, write a UQFF file with mistralrs quantize. See the UQFF guide.

ISQ runs at model load time: the engine reads each weight, produces its quantized form in parallel, and discards the source before moving on. First-run load is slower than pre-quantized formats (UQFF, GGUF), which have no conversion work to do. To skip the conversion on repeated loads, save the result as UQFF.

Format families
  • Q*K (q2k-q6k): GGML-compatible block quantization. Broadly applicable, works on all backends.
  • AFQ (afq2-afq8): affine quantization optimized for Apple Silicon. Runs on Metal (native kernels), CUDA (dedicated backend), and CPU (fallback).
  • Legacy GGML (q4_0, q4_1, q5_0, q5_1, q8_0): supported for GGUF compatibility.
  • FP8 (fp8, f8q8): native FP8 matmul on NVIDIA compute capability 8.9+.
  • MXFP4 (4-bit microscaling): native fast path on Blackwell.
  • HQQ (hqq4, hqq8): alternative 4- and 8-bit schemes.

The numeric shorthand picks a format the active device supports; explicit names override that, and incompatible combinations are rejected at load time. Full constraint table: quantization types reference.

--isq-organization selects which layers get quantized:

  • default: every linear layer the pipeline exposes for quantization.
  • moqe (MoQE, Mixture of Quantization Experts): only MoE (Mixture of Experts) expert layers; the shared (non-expert) trunk stays at native precision.

moqe is useful on MoE models where the experts dominate parameter count.

An importance matrix (imatrix) is a per-column weight derived from running the model on calibration data and accumulating squared input activations. The quantizer uses it to allocate precision to higher-impact weights, which matters most at low bit widths.

Two flags, used with --isq:

  • --imatrix <path>: load an existing imatrix. Accepts llama.cpp .imatrix files (layer names are mapped automatically) or mistral.rs .cimatrix files.
  • --calibration-file <path>: generate the importance data at load time by running the calibration text through the model, then quantize.

The two conflict. --imatrix is reused across runs; --calibration-file re-generates on every load. Importance weighting applies to the K-quant formats (q2k-q6k); other formats quantize without it. Calibration runs on all pipelines (text, multimodal, embedding); the calibration text drives the language model, so vision/audio encoder layers quantize without importance data.

To collect an imatrix from a live server’s real traffic instead of a static calibration file, see online calibration.

Full example: Rust, Python.

Interaction with paged attention and flash attention

ISQ applies to weights. The KV cache is a separate budget: paged attention manages its memory independently, and --pa-cache-type quantizes the cache itself. Flash attention operates on activations, not weights, and composes with any ISQ format.

These load directly, with no ISQ conversion:

  • UQFF: the native pre-quantized format. Loaded automatically by --quant when a sibling UQFF repo exists, or explicitly via --from-uqff. See the UQFF guide.
  • GGUF: loaded via --format gguf -f <file>.
  • GPTQ, AWQ: detected from the source repo’s config and loaded directly; no --quant or --isq needed.