Performance
Guides for tuning throughput, memory, and latency.
Choose by constraint
Section titled “Choose by constraint”| If you need to… | Start here |
|---|---|
| Fit a model into less memory | Pick a quantization method |
| Let mistral.rs benchmark the host | Let the tune command decide for you |
| Improve attention throughput on NVIDIA GPUs | Use flash attention |
| Improve high-concurrency serving memory use | Use paged attention |
| Split one model across local GPUs | Multi-GPU tensor parallelism |
| Split one model across machines | Multi-machine inference with the ring backend |
| Place layers manually | Topology |
| Reduce latency with a draft model | Speculative decoding |
| Save an ISQ result for faster reloads | UQFF for pre-quantized models |
Underlying concepts (paged attention design, what quantization changes, MLA) live in the Explanation section.