Performance

Guides for tuning throughput, memory, and latency.

Choose by constraint

If you need to…	Start here
Fit a model into less memory	Pick a quantization method
Let mistral.rs benchmark the host	Let the tune command decide for you
Improve attention throughput on NVIDIA GPUs	Use flash attention
Improve high-concurrency serving memory use	Use paged attention
Reduce CUDA decode launch overhead	Use CUDA graphs
Compare multi-GPU and distributed modes	Multi-GPU and distributed inference
Split one model across local GPUs	Single-machine multi-GPU
Run NCCL across machines	Multi-node NCCL inference
Use the ring backend	Ring backend inference
Place layers manually	Topology
Reduce decode latency with MTP	Speculative decoding
Use Gemma 4 assistant checkpoints for MTP	Gemma 4 MTP
Save an ISQ result for faster reloads	UQFF for pre-quantized models

Underlying concepts (paged attention design, what quantization changes, MLA) live in the Explanation section.