Performance
Guides for tuning throughput, memory, and latency.
Choose by constraint
Section titled “Choose by constraint”| If you need to… | Start here |
|---|---|
| Fit a model into less memory | Pick a quantization method |
| Let mistral.rs benchmark the host | Let the tune command decide for you |
| Improve attention throughput on NVIDIA GPUs | Use flash attention |
| Improve high-concurrency serving memory use | Use paged attention |
| Reduce CUDA decode launch overhead | Use CUDA graphs |
| Compare multi-GPU and distributed modes | Multi-GPU and distributed inference |
| Split one model across local GPUs | Single-machine multi-GPU |
| Run NCCL across machines | Multi-node NCCL inference |
| Use the ring backend | Ring backend inference |
| Place layers manually | Topology |
| Reduce decode latency with MTP | Speculative decoding |
| Use Gemma 4 assistant checkpoints for MTP | Gemma 4 MTP |
| Save an ISQ result for faster reloads | UQFF for pre-quantized models |
Underlying concepts (paged attention design, what quantization changes, MLA) live in the Explanation section.