Skip to content

Multi-node NCCL inference

Multi-node NCCL inference extends tensor parallelism across machines. Each node contributes one or more local CUDA ranks to one global NCCL communicator.

This is separate from the ring backend. Multi-node NCCL uses MISTRALRS_MN_* variables and does not use RING_CONFIG.

Build the same CUDA+NCCL binary on every node:

Terminal window
cargo install mistralrs-cli --features "cuda nccl flash-attn cudnn"

Every node should use the same model, dtype, quantization, and runtime arguments. Use CUDA_VISIBLE_DEVICES on each node to choose the local GPUs that participate.

Use the same local TP size on every node. The global world size should be:

global world size = local TP size * number of nodes

Common variables:

VariableWherePurpose
MISTRALRS_MN_GLOBAL_WORLD_SIZEAll nodesTotal NCCL ranks across all nodes. Presence of this variable enables multi-node mode.
MISTRALRS_MN_LOCAL_WORLD_SIZEAll nodesNumber of local CUDA ranks contributed by each node.
MISTRALRS_MN_HEAD_NUM_WORKERSHead nodeNumber of worker nodes.
MISTRALRS_MN_HEAD_PORTHead nodeTCP port used to distribute the NCCL id to worker nodes.
MISTRALRS_MN_WORKER_SERVER_ADDRWorker nodeshost:port of the head node.
MISTRALRS_MN_WORKER_IDWorker nodesZero-based worker node id.

Do not set MISTRALRS_NO_NCCL. Do not set RING_CONFIG.

Two nodes, four GPUs per node:

Head node:

Terminal window
CUDA_VISIBLE_DEVICES=0,1,2,3 \
MISTRALRS_MN_GLOBAL_WORLD_SIZE=8 \
MISTRALRS_MN_LOCAL_WORLD_SIZE=4 \
MISTRALRS_MN_HEAD_NUM_WORKERS=1 \
MISTRALRS_MN_HEAD_PORT=9000 \
mistralrs serve -m Qwen/Qwen3-32B --quant 4

Worker node:

Terminal window
CUDA_VISIBLE_DEVICES=0,1,2,3 \
MISTRALRS_MN_GLOBAL_WORLD_SIZE=8 \
MISTRALRS_MN_LOCAL_WORLD_SIZE=4 \
MISTRALRS_MN_WORKER_SERVER_ADDR=10.0.0.1:9000 \
MISTRALRS_MN_WORKER_ID=0 \
mistralrs serve -m Qwen/Qwen3-32B --quant 4

Send client requests to the head node.

The head port must be reachable from every worker. NCCL must also be able to use the network interface between nodes; on multi-interface machines, set NCCL networking variables such as NCCL_SOCKET_IFNAME in the shell before launching.

Use the ring backend only when you intentionally want the ring transport. A binary built with both nccl and ring will prefer NCCL unless MISTRALRS_NO_NCCL=1 is set.