Single-machine multi-GPU
This page covers one machine with multiple local GPUs. For the full mode comparison, start with multi-GPU and distributed inference.
Tensor parallelism splits each layer across all GPUs and uses NCCL collectives to combine partial results. This is the preferred CUDA multi-GPU mode when the model supports it.
Layer mapping places different layer ranges on different devices. It is the fallback when NCCL is unavailable, disabled, or not suitable for the selected model. CUDA layer mapping enables peer access (P2P) for GPU pairs that support it; otherwise boundary activations are staged through CPU.
Default selection
Section titled “Default selection”With no manual mapping flags:
- One visible GPU runs the whole model on that GPU.
- Multiple visible CUDA GPUs use NCCL tensor parallelism when the binary was built with
cuda ncclandMISTRALRS_NO_NCCLis not set. - If NCCL is unavailable or disabled, mistral.rs uses layer mapping across the visible GPUs.
The selected layout is printed in the startup logs.
Build requirements
Section titled “Build requirements”Linux CUDA installs enable nccl when the installer or wheel builder finds libnccl.
Manual Linux CUDA build with NCCL:
cargo install mistralrs-cli --features "cuda nccl flash-attn cudnn"If NCCL is not installed, omit nccl:
cargo install mistralrs-cli --features "cuda flash-attn cudnn"To force the installer decision, use MISTRALRS_INSTALL_NCCL=1 or MISTRALRS_INSTALL_NO_NCCL=1. To disable NCCL at runtime without rebuilding:
MISTRALRS_NO_NCCL=1 mistralrs serve -m Qwen/Qwen3-32B --quant 4Select GPUs
Section titled “Select GPUs”Use CUDA_VISIBLE_DEVICES to restrict the GPU set before mistral.rs starts:
CUDA_VISIBLE_DEVICES=0,1 mistralrs serve -m Qwen/Qwen3-32B --quant 4The ordinals in --device-layers are the visible ordinals after CUDA_VISIBLE_DEVICES is applied.
NCCL tensor parallelism uses all visible CUDA GPUs. The tensor-parallel size must be compatible with the model:
- Attention heads must divide evenly across GPUs.
- KV heads must either divide evenly across GPUs or be replicated evenly when there are fewer KV heads than GPUs.
If the visible GPU count is incompatible, mistral.rs errors instead of selecting a smaller subset.
Use CUDA_VISIBLE_DEVICES to choose a compatible subset.
Manual layer mapping
Section titled “Manual layer mapping”-n/--device-layers assigns layer counts to devices. Format:
mistralrs serve -n "0:32;1:32" -m <model>For uneven GPUs, put fewer layers on the smaller or busier GPU:
mistralrs serve -n "0:44;1:20" -m Qwen/Qwen3-32B --quant 4For per-layer or per-tensor placement, use the topology guide.
Performance notes
Section titled “Performance notes”Use NCCL when possible for single-machine CUDA tensor parallelism. It keeps collective communication on the GPU path and is the expected path for multiple similar GPUs.
Layer mapping moves activations only at layer-boundary device changes, so contiguous ranges matter. On CUDA, peer access is enabled for supported GPU pairs. If the driver reports that peer access is unavailable or cannot be enabled, those transfers stage through CPU and startup logs include a warning.
For mixed GPU memory sizes, manually set --device-layers; the automatic split does not optimize for heterogeneous memory or PCIe/NVLink topology.
For tensor parallelism across machines, use multi-node NCCL inference. For the ring transport, use the ring backend guide.