Quantization in mistral.rs

Mistral.rs supports the following quantization:

⭐ ISQ (read more detail)
- Supported in all plain/vision and adapter models
- Works on all supported devices
- Automatic selection to use the fastest and most accurate method
- Supports:
  - Q, K type GGUF quants
  - AFQ
  - HQQ
  - FP8
  - F8Q8
GGUF/GGML
- Q, K type
- Supported in GGUF/GGML and GGUF/GGML adapter models
- Supported in all plain/vision and adapter models
- Imatrix quantization is supported
- I quants coming!
- CPU, CUDA, Metal (all supported devices)
- 2, 3, 4, 5, 6, 8 bit
GPTQ (convert with this script)
- Supported in all plain/vision and adapter models
- CUDA only
- 2, 3, 4, 8 bit
- Marlin kernel support in 4-bit and 8-bit.
AWQ (convert with this script)
- Supported in all plain/vision and adapter models
- CUDA only
- 4 and 8 bit
- Marlin kernel support in 4-bit and 8-bit.
HQQ
- Supported in all plain/vision and adapter models via ISQ
- 4, 8 bit
- CPU, CUDA, Metal (all supported devices)
FP8
- Supported in all plain/vision and adapter models
- CPU, CUDA, Metal (all supported devices)
BNB
- Supported in all plain/vision and adapter models
- bitsandbytes int8, fp4, nf4 support
AFQ
- 2, 3, 4, 6, 8 bit
- 🔥 Designed to be fast on Metal!
- Only supported on Metal.
MLX prequantized
- Supported in all plain/vision and adapter models

Using a GGUF quantized model

mistralrs run --format gguf -f my-gguf-file.gguf

See the docs

mistralrs run --isq 4 -m microsoft/Phi-3-mini-4k-instruct

Provide the model ID for the GPTQ model
Mistral.rs will automatically detect and use GPTQ quantization for plain and vision models!
The Marlin kernel will automatically be used for 4-bit and 8-bit.

mistralrs run -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bit

You can create your own GPTQ model using [scripts/convert_to_gptq.py][../scripts/convert_to_gptq.py]:

pip install gptqmodel transformers datasets

python3 scripts/convert_to_gptq.py --src path/to/model --dst output/model/path --bits 4

Provide the model ID for the MLX prequantized model
Mistral.rs will automatically detect and use quantization for plain and vision models!
Specialized kernels will be used to accelerate inference!

mistralrs run -m mlx-community/Llama-3.8-1B-8bit