Skip to content

Pick a quantization method

mistral.rs supports multiple quantization formats. --quant 4 is the common starting point.

--quant N prefers a prebuilt UQFF when one is published, otherwise falls back to runtime ISQ. In the fallback path, the numeric shorthand resolves to a hardware-appropriate format:

ShorthandMetalCUDA / CPU
2AFQ2Q2K
3AFQ3Q3K
4AFQ4Q4K
5Q5KQ5K
6AFQ6Q6K
8AFQ8Q8_0

Explicit format names (e.g., --quant q4k, --quant afq8) request that format. Use --isq only when you want to force runtime ISQ and skip the UQFF lookup.

FormatWhen to use
UQFFNative pre-quantized format. Loaded automatically by --quant when a sibling UQFF repo exists, or directly via --from-uqff. See UQFF guide.
GGUFLoaded via --format gguf -f <file>.
GPTQ, AWQLoaded directly with --format plain when the source repo is pre-quantized.

Independent of format. Fewer bits produces a smaller model.

Supported widths: 2, 3, 4, 5, 6, 8. Full bit-width by format support: quantization reference.

mistralrs tune -m <model> recommends per-host quantization. See the auto-tune guide.