Use pre-quantized UQFF models
UQFF (Universal Quantized File Format) stores pre-quantized weights and loads directly without runtime conversion.
Using a UQFF model
Section titled “Using a UQFF model”mistralrs run -m <repo> --from-uqff q4k-0.uqff-m <repo> is required for tokenizer/base resolution. --from-uqff accepts a numeric shorthand
(2, 3, 4, 5, 6, 8) or an ISQ (in-situ quantization)
type name (q4k, afq8, etc.) in place of a filename.
For locally-stored UQFF files, -m can be the local directory and --from-uqff the filename.
from_uqff on Which takes the shard filename (or a list of shard filenames):
from mistralrs import Runner, Which
runner = Runner( which=Which.Plain( model_id="<repo>", from_uqff="q4k-0.uqff", ),)UqffTextModelBuilder takes the base repo and the first shard:
use mistralrs::UqffTextModelBuilder;
let model = UqffTextModelBuilder::new("<repo>", vec!["q4k-0.uqff".into()]) .build() .await?;For sharded UQFFs, pass the first shard’s filename. Shards share a common prefix and end in
-0, -1, … (e.g. q4k-0.uqff, q4k-1.uqff); discovery matches the trailing numeric suffix.
UQFF models work under tensor parallelism: each rank loads only its slice of the quantized weights.
Full example: Rust, multimodal.
Producing a UQFF
Section titled “Producing a UQFF”The quantize subcommand converts an unquantized model to UQFF:
mistralrs quantize \ -m google/gemma-4-E4B-it \ --isq q4k \ -o gemma-q4k.uqffA one-time operation. The result loads directly afterward.
--isq can be repeated or comma-separated to produce multiple variants in one run; pass a
directory as -o in that case. Numeric shorthands expand to all platform variants (--isq 4
writes both afq4.uqff and q4k.uqff).
When write_uqff is used from the Rust or Python SDK and the session keeps serving, the
in-memory model runs as the first requested type.
A topology can pin specific layers to a different type
(e.g. keep lm_head at q8_0 in an otherwise Q4K file); pins are preserved in every output
variant.
In directory mode a README model card is generated unless --no-readme is passed;
--uqff-base-model and --uqff-repo-id fill in its fields without the interactive prompt.
K-quant output quality can be improved with an importance matrix: pass --imatrix <file>
(llama.cpp .imatrix files work directly) or --calibration-file <path> to quantize. See
imatrix background.
All quantize flags: CLI reference.
Format details
Section titled “Format details”Layout and versioning: UQFF format reference.