Skip to content

Performance

Guides for tuning throughput, memory, and latency.

If you need to…Start here
Fit a model into less memoryPick a quantization method
Let mistral.rs benchmark the hostLet the tune command decide for you
Improve attention throughput on NVIDIA GPUsUse flash attention
Improve high-concurrency serving memory useUse paged attention
Split one model across local GPUsMulti-GPU tensor parallelism
Split one model across machinesMulti-machine inference with the ring backend
Place layers manuallyTopology
Reduce latency with a draft modelSpeculative decoding
Save an ISQ result for faster reloadsUQFF for pre-quantized models

Underlying concepts (paged attention design, what quantization changes, MLA) live in the Explanation section.