Split a model across multiple GPUs
When a model exceeds one GPU’s memory after quantization, mistral.rs can split it across multiple GPUs on the same host.
Auto-detection
Section titled “Auto-detection”The CLI detects available GPUs and splits across them when more than one is present:
mistralrs serve -m Qwen/Qwen3-32B --quant 4To restrict the device set, use the CUDA convention:
CUDA_VISIBLE_DEVICES=0,1 mistralrs serve -m Qwen/Qwen3-32B --quant 4Per-device layer counts
Section titled “Per-device layer counts”-n/--device-layers specifies layers per GPU. Format: ORD:NUM;ORD:NUM;....
mistralrs serve -n "0:32;1:32" -m <model>For per-tensor or per-layer placement, see the topology guide.
Multi-machine
Section titled “Multi-machine”For cross-machine splitting, see the ring backend guide.