Multi-GPU and distributed inference
mistral.rs has four ways to spread one model beyond a single GPU. They solve different problems and use different communication paths.
| Mode | Scope | Transport | Use when |
|---|---|---|---|
| NCCL tensor parallelism | One machine, multiple CUDA GPUs | NCCL collectives | You have similar local GPUs and want the default high-throughput CUDA path. |
| Layer mapping with CUDA P2P | One machine, multiple devices | CUDA peer access when available, CPU staging otherwise | NCCL is unavailable, disabled, or the model/layout needs contiguous layer placement. |
| Multi-node NCCL | Multiple machines, each with local CUDA GPUs | NCCL across all ranks, with mistral.rs coordinating node startup | You want tensor parallelism across machines and each node contributes one or more local CUDA ranks. |
| Ring backend | Multiple machines | mistral.rs ring transport from RING_CONFIG | You explicitly want the ring backend instead of NCCL. |
Selection Flow
Section titled “Selection Flow”With a CUDA build and no manual mapping:
- One visible GPU runs the model on that GPU.
- Multiple visible GPUs use NCCL tensor parallelism when the binary has
cuda ncclandMISTRALRS_NO_NCCLis not set. - If NCCL is unavailable or disabled, mistral.rs uses layer mapping. CUDA pairs use P2P when the driver allows it; otherwise transfers stage through CPU.
- If
MISTRALRS_MN_GLOBAL_WORLD_SIZEis set, NCCL tensor parallelism is extended across nodes. - If
RING_CONFIGis set and the binary hasring, the ring backend is available. If the binary also has NCCL, setMISTRALRS_NO_NCCL=1to force ring.
Start Here
Section titled “Start Here”For one machine with multiple CUDA GPUs, start with single-machine multi-GPU.
For multiple machines using NCCL tensor parallelism, use multi-node NCCL inference.
For multiple machines using the ring backend, use ring backend inference.
For exact layer or tensor placement on one host, use the topology guide.