Skip to content

Multi-GPU and distributed inference

mistral.rs has four ways to spread one model beyond a single GPU. They solve different problems and use different communication paths.

ModeScopeTransportUse when
NCCL tensor parallelismOne machine, multiple CUDA GPUsNCCL collectivesYou have similar local GPUs and want the default high-throughput CUDA path.
Layer mapping with CUDA P2POne machine, multiple devicesCUDA peer access when available, CPU staging otherwiseNCCL is unavailable, disabled, or the model/layout needs contiguous layer placement.
Multi-node NCCLMultiple machines, each with local CUDA GPUsNCCL across all ranks, with mistral.rs coordinating node startupYou want tensor parallelism across machines and each node contributes one or more local CUDA ranks.
Ring backendMultiple machinesmistral.rs ring transport from RING_CONFIGYou explicitly want the ring backend instead of NCCL.

With a CUDA build and no manual mapping:

  1. One visible GPU runs the model on that GPU.
  2. Multiple visible GPUs use NCCL tensor parallelism when the binary has cuda nccl and MISTRALRS_NO_NCCL is not set.
  3. If NCCL is unavailable or disabled, mistral.rs uses layer mapping. CUDA pairs use P2P when the driver allows it; otherwise transfers stage through CPU.
  4. If MISTRALRS_MN_GLOBAL_WORLD_SIZE is set, NCCL tensor parallelism is extended across nodes.
  5. If RING_CONFIG is set and the binary has ring, the ring backend is available. If the binary also has NCCL, set MISTRALRS_NO_NCCL=1 to force ring.

For one machine with multiple CUDA GPUs, start with single-machine multi-GPU.

For multiple machines using NCCL tensor parallelism, use multi-node NCCL inference.

For multiple machines using the ring backend, use ring backend inference.

For exact layer or tensor placement on one host, use the topology guide.