UQFF format
UQFF is the native mistral.rs quantized file format. To use UQFF models, see the UQFF guide; knowledge of the layout is not required.
File structure
Section titled “File structure”A UQFF export is a directory containing:
- One or more
<stem>-<shard>.uqffshards holding the quantized layers. residual.safetensorsfor unquantized tensors (token embeddings, norms, etc.).- Model assets copied from the source repo so the directory is self-contained:
config.json,tokenizer.json,tokenizer_config.json,generation_config.json, and (when present)modules.json,chat_template.jinja,processor_config.json,preprocessor_config.json.
A loader is pointed at one or more shard files (from_uqff); the residual safetensors and the JSON assets are picked up by sibling-path lookup.
Shard layout
Section titled “Shard layout”Each .uqff shard is a standard safetensors file with named entries. Every quantized layer is self-describing:
<key>.weight- the layer data (raw blocks for GGML-family types, packed tensors for AFQ/MXFP4/FP8, or a native safetensors tensor for unquantized fallback layers; see quantization types).<key>.weight.format- a u8 tag naming the quantization family, used to dispatch the deserializer.- Family-specific metadata next to it, e.g.
<key>.weight.dtypeand<key>.weight.shapefor GGML types,<key>.weight.scales/.bits/.group_sizefor AFQ. <key>.biaswhen the layer has one.
Safetensors metadata includes informational producer fields: uqff.producer, uqff.producer.mistralrs.version, and uqff.producer.mistralrs.git_revision. These are recorded for provenance and are not checked by the loader.
<key> is the layer’s weight path (model.layers.0.self_attn.q_proj). MoE (Mixture of Experts) expert layers use three canonical keys per block: <...>.experts.gate_proj, .up_proj, .down_proj, each holding the stacked [num_experts, out, in] weights.
Because every layer self-describes, a single file may mix quantization types. Two cases produce a mixed file:
- Topology-pinned layers (assigned a specific type by a topology config) keep their pinned type.
- Layers whose shape cannot support the requested type fall back per-layer. For example, AFQ layers whose input dimension is not divisible by the AFQ group size are stored unquantized.
Sharding
Section titled “Sharding”The writer splits the tensor stream into <stem>-0.uqff, <stem>-1.uqff, … with a soft cap of 10 GiB per shard. Multiple ISQ (in-situ quantization) types in one run produce one shard set per type (q4k-0.uqff, afq4-0.uqff, …) sharing the residual and assets.
Version compatibility
Section titled “Version compatibility”Each shard set carries three u32 scalar entries: uqff.version.major, uqff.version.minor, uqff.version.patch. Readers reject a different major version and reject a minor newer than they support; older minors within the same major are accepted. Files without version entries are rejected.
UQFF 1.1 adds inline unquantized linear entries (weight.format = Unquant) so mixed files can preserve unsupported layer shapes without moving those weights into residual.safetensors.
Tensor parallelism
Section titled “Tensor parallelism”Shards store full tensors; under tensor parallelism each rank slices its portion at load time. Slicing the packed (input) dim requires block alignment, which holds for typical model dims. When alignment does not hold (some expert layers), the rank replicates the full tensor instead of slicing.
Reference implementation
Section titled “Reference implementation”Canonical implementations: mistralrs-quant/src/uqff (reader, tensor encoding) and mistralrs-core/src/pipeline/isq.rs (writer).
Caveats
Section titled “Caveats”- UQFF is inference-only; no optimizer state or training metadata.
- The export directory is the unit of distribution. A shard alone is not loadable — the residual safetensors and
config.jsonare required.