Use pre-quantized UQFF models

UQFF (Universal Quantized File Format) stores pre-quantized weights and loads directly without runtime conversion.

Using a UQFF model

mistralrs run -m <repo> --from-uqff q4k-0.uqff

-m <repo> is required for tokenizer/base resolution. --from-uqff accepts a numeric shorthand (2, 3, 4, 5, 6, 8) or an ISQ (in-situ quantization) type name (q4k, afq8, etc.) in place of a filename.

For locally-stored UQFF files, -m can be the local directory and --from-uqff the filename.

from_uqff on Which takes the shard filename (or a list of shard filenames):

from mistralrs import Runner, Which

runner = Runner(
    which=Which.Plain(
        model_id="<repo>",
        from_uqff="q4k-0.uqff",
    ),
)

UqffTextModelBuilder takes the base repo and the first shard:

use mistralrs::UqffTextModelBuilder;

let model = UqffTextModelBuilder::new("<repo>", vec!["q4k-0.uqff".into()])
    .build()
    .await?;

For sharded UQFFs, pass the first shard's filename. Shards share a common prefix and end in -0, -1, ... (e.g. q4k-0.uqff, q4k-1.uqff); discovery matches the trailing numeric suffix.

UQFF models work under tensor parallelism: each rank loads only its slice of the quantized weights.

Full example: Rust, multimodal.

Producing a UQFF

The quantize subcommand converts an unquantized model to UQFF:

mistralrs quantize \
  -m google/gemma-4-E4B-it \
  --isq q4k \
  -o gemma-q4k.uqff

A one-time operation. The result loads directly afterward.

--isq can be repeated or comma-separated to produce multiple variants in one run; pass a directory as -o in that case. Numeric shorthands expand to all platform variants (--isq 4 writes both afq4.uqff and q4k.uqff).

When write_uqff is used from the Rust or Python SDK and the session keeps serving, the in-memory model runs as the first requested type.

A topology can pin specific layers to a different type (e.g. keep lm_head at q8_0 in an otherwise Q4K file); pins are preserved in every output variant.

Sensitive token embeddings and output heads follow the higher-precision defaults in the quantization type reference. For example, AFQ4 uses AFQ6 for those tensors, Q4K uses Q6K, and Q6K uses Q8_0. The filename still names the default model type, and uqff_report.json records the effective type for each tensor. Explicit topology pins take precedence. The active model loader declares the exact tensors covered by this policy; similarly named tensors in vision, audio, or auxiliary subtrees are not promoted implicitly.

In directory mode a README model card is generated unless --no-readme is passed; --uqff-base-model and --uqff-repo-id fill in its fields without the interactive prompt.

quantize also writes uqff_report.json beside the UQFF files. The report is informational: loaders ignore it, but it records the generated variants, shard names, stored layer formats, producer version, and any fallback layers. Existing UQFF repos do not need to be regenerated. Instead, you can backfill a report by scanning the artifacts:

mistralrs uqff report -m gemma4_26b_a4b/ \
  --write \
  --base-model google/gemma-4-26B-A4B-it \
  --repo-id mistralrs-community/gemma-4-26B-A4B-it

For Hugging Face repos, use the same model-id form and select a group with --quant when desired:

mistralrs uqff report -m mistralrs-community/gemma-4-26B-A4B-it --quant afq3 --json

The uqff report, uqff verify, and uqff inspect commands are metadata-only. For local paths or already cached Hugging Face artifacts they seek-read only the needed byte ranges. For remote Hugging Face repos they use HTTP byte-range requests for safetensors headers and small UQFF metadata tensors, so they do not download the full model weights.

Before publishing, validate the structure:

mistralrs uqff verify -m gemma4_26b_a4b/

To browse tensors interactively:

mistralrs uqff inspect -m mistralrs-community/gemma-4-26B-A4B-it --quant afq3

K-quant output quality can be improved with an importance matrix: pass --imatrix <file> (llama.cpp .imatrix files work directly) or --calibration-file <path> to quantize. See imatrix background.

All quantize flags: CLI reference.

Format details

Layout and versioning: UQFF format reference.