Skip to content

Online calibration

Online calibration observes the activations of a live model quantized with ISQ (in-situ quantization), then requantizes every layer from the original weights using an importance matrix derived from that traffic. The layers are hot-swapped in place with no restart: the model serves normally while collecting, and requests received during the apply step queue until it finishes.

Quantized this way, a model is measurably closer to its full-precision outputs on the distribution it actually serves, at the same bit width and speed.

Serve any model with ISQ:

Terminal window
mistralrs serve -m <model> --isq q4k

Then drive the lifecycle on the surface of your choice. There is no CLI command for the lifecycle itself; it is driven over HTTP or from an SDK against the running server.

Terminal window
# begin observing live traffic; collection adds some decode overhead while on, and on CUDA,
# MoE (Mixture of Experts) models additionally run their reference expert path during collection
curl -X POST localhost:1234/calibration/start
# check per-layer collection progress
curl localhost:1234/calibration/status
# requantize from the source weights with the collected statistics and hot-swap
curl -X POST localhost:1234/calibration/apply \
-H "Content-Type: application/json" \
-d '{"save_cimatrix": "traffic.cimatrix"}'

status reports how many layers are collecting and the token rows seen per layer. apply harvests the statistics, requantizes, and returns the pre-apply status. The optional save_cimatrix writes the collected importance matrix for reuse with --imatrix.

Collection costs nothing until started, and decode returns to full speed after apply.

  • The model must have been loaded with --isq from source weights (safetensors); start errors otherwise (including models loaded --from-uqff).
  • Importance weighting applies to the K-quant types (Q2K-Q6K). GGUF-family and AFQ types collect and requantize; HQQ and FP8 ISQ types do not support collection, so start errors.
  • Pre-quantized source checkpoints (FP8, GPTQ, BnB) requantize from the resident weights, not the source files.
  • Layers whose weights cannot be re-read exactly (matformer slicing, rank-sharded fused expert halves) requantize from the resident weights instead, logged at apply time.
  • MoE expert stacks are rebuilt from the checkpoint in any supported layout.