Skip to content

Split a model across multiple GPUs

When a model exceeds one GPU’s memory after quantization, mistral.rs can split it across multiple GPUs on the same host.

The CLI detects available GPUs and splits across them when more than one is present:

Terminal window
mistralrs serve -m Qwen/Qwen3-32B --quant 4

To restrict the device set, use the CUDA convention:

Terminal window
CUDA_VISIBLE_DEVICES=0,1 mistralrs serve -m Qwen/Qwen3-32B --quant 4

-n/--device-layers specifies layers per GPU. Format: ORD:NUM;ORD:NUM;....

Terminal window
mistralrs serve -n "0:32;1:32" -m <model>

For per-tensor or per-layer placement, see the topology guide.

For cross-machine splitting, see the ring backend guide.