Skip to content

Use pre-quantized UQFF models

UQFF (Universal Quantized File Format) stores pre-quantized weights and loads directly without runtime conversion.

Terminal window
mistralrs run -m <repo> --from-uqff q4k-0.uqff

-m <repo> is required for tokenizer/base resolution. --from-uqff accepts a numeric shorthand (2, 3, 4, 5, 6, 8) or an ISQ (in-situ quantization) type name (q4k, afq8, etc.) in place of a filename.

For locally-stored UQFF files, -m can be the local directory and --from-uqff the filename.

For sharded UQFFs, pass the first shard’s filename. Shards share a common prefix and end in -0, -1, … (e.g. q4k-0.uqff, q4k-1.uqff); discovery matches the trailing numeric suffix.

UQFF models work under tensor parallelism: each rank loads only its slice of the quantized weights.

Full example: Rust, multimodal.

The quantize subcommand converts an unquantized model to UQFF:

Terminal window
mistralrs quantize \
-m google/gemma-4-E4B-it \
--isq q4k \
-o gemma-q4k.uqff

A one-time operation. The result loads directly afterward.

--isq can be repeated or comma-separated to produce multiple variants in one run; pass a directory as -o in that case. Numeric shorthands expand to all platform variants (--isq 4 writes both afq4.uqff and q4k.uqff).

When write_uqff is used from the Rust or Python SDK and the session keeps serving, the in-memory model runs as the first requested type.

A topology can pin specific layers to a different type (e.g. keep lm_head at q8_0 in an otherwise Q4K file); pins are preserved in every output variant.

In directory mode a README model card is generated unless --no-readme is passed; --uqff-base-model and --uqff-repo-id fill in its fields without the interactive prompt.

K-quant output quality can be improved with an importance matrix: pass --imatrix <file> (llama.cpp .imatrix files work directly) or --calibration-file <path> to quantize. See imatrix background.

All quantize flags: CLI reference.

Layout and versioning: UQFF format reference.