Skip to content

Performance

Guides for tuning throughput, memory, and latency.

If you need to…Start here
Fit a model into less memoryPick a quantization method
Let mistral.rs benchmark the hostLet the tune command decide for you
Improve attention throughput on NVIDIA GPUsUse flash attention
Improve high-concurrency serving memory useUse paged attention
Reduce CUDA decode launch overheadUse CUDA graphs
Compare multi-GPU and distributed modesMulti-GPU and distributed inference
Split one model across local GPUsSingle-machine multi-GPU
Run NCCL across machinesMulti-node NCCL inference
Use the ring backendRing backend inference
Place layers manuallyTopology
Reduce decode latency with MTPSpeculative decoding
Use Gemma 4 assistant checkpoints for MTPGemma 4 MTP
Save an ISQ result for faster reloadsUQFF for pre-quantized models

Underlying concepts (paged attention design, what quantization changes, MLA) live in the Explanation section.