Online calibration
Online calibration observes the activations of a live model quantized with ISQ (in-situ quantization), then requantizes every layer from the original weights using an importance matrix derived from that traffic. The layers are hot-swapped in place with no restart: the model serves normally while collecting, and requests received during the apply step queue until it finishes.
Quantized this way, a model is measurably closer to its full-precision outputs on the distribution it actually serves, at the same bit width and speed.
Serve any model with ISQ:
mistralrs serve -m <model> --isq q4kThen drive the lifecycle on the surface of your choice. There is no CLI command for the lifecycle itself; it is driven over HTTP or from an SDK against the running server.
# begin observing live traffic; collection adds some decode overhead while on, and on CUDA,# MoE (Mixture of Experts) models additionally run their reference expert path during collectioncurl -X POST localhost:1234/calibration/start
# check per-layer collection progresscurl localhost:1234/calibration/status
# requantize from the source weights with the collected statistics and hot-swapcurl -X POST localhost:1234/calibration/apply \ -H "Content-Type: application/json" \ -d '{"save_cimatrix": "traffic.cimatrix"}'status reports how many layers are collecting and the token rows seen per layer. apply
harvests the statistics, requantizes, and returns the pre-apply status. The optional
save_cimatrix writes the collected importance matrix for reuse with --imatrix.
The same lifecycle is exposed on Model:
model.begin_calibration().await?;// ... serve traffic ...let status = model.calibration_status().await?;model.apply_calibration(Some("traffic.cimatrix".into())).await?;Each method has a _with_model variant for multi-model setups. See the
full example.
runner.begin_calibration()# ... serve traffic ...status = runner.calibration_status()runner.apply_calibration(save_cimatrix="traffic.cimatrix")calibration_status returns a CalibrationStatus with collecting, layers,
layers_tracking, total_rows, min_rows, and max_rows fields. See the
full example.
Collection costs nothing until started, and decode returns to full speed after apply.
Requirements and behavior
Section titled “Requirements and behavior”- The model must have been loaded with
--isqfrom source weights (safetensors);starterrors otherwise (including models loaded--from-uqff). - Importance weighting applies to the K-quant types (
Q2K-Q6K). GGUF-family and AFQ types collect and requantize; HQQ and FP8 ISQ types do not support collection, sostarterrors. - Pre-quantized source checkpoints (FP8, GPTQ, BnB) requantize from the resident weights, not the source files.
- Layers whose weights cannot be re-read exactly (matformer slicing, rank-sharded fused expert halves) requantize from the resident weights instead, logged at apply time.
- MoE expert stacks are rebuilt from the checkpoint in any supported layout.