Cargo Features Reference
This document provides a complete reference for all cargo features available in mistral.rs.
Quick Reference
| Feature | Description | Platform | Requires |
|---|---|---|---|
cuda | NVIDIA GPU acceleration | Linux, Windows | CUDA toolkit |
cudnn | NVIDIA cuDNN backend | Linux, Windows | cuda, cuDNN |
flash-attn | FlashAttention V2 | Linux, Windows | cuda, CC >= 8.0 |
flash-attn-v3 | FlashAttention V3 | Linux, Windows | cuda, CC >= 9.0 |
metal | Apple GPU acceleration | macOS | - |
accelerate | Apple CPU acceleration | macOS | - |
mkl | Intel MKL acceleration | Linux, Windows | Intel MKL |
nccl | Multi-GPU (NVIDIA NCCL) | Linux | cuda, NCCL |
ring | Multi-GPU/node (TCP ring) | All | - |
GPU Acceleration Features
cuda
Enables NVIDIA GPU acceleration via CUDA. This is the primary feature for running on NVIDIA GPUs.
Requirements:
- NVIDIA GPU
- CUDA toolkit installed
- Linux or Windows (WSL supported)
Usage:
cargo build --release --features cuda
cargo install mistralrs-cli --features cuda
What it enables:
- GPU tensor operations via CUDA
- PagedAttention on CUDA devices
- Quantized inference on GPU
cudnn
Enables NVIDIA cuDNN for optimized neural network primitives. Provides faster convolutions and other operations.
Requirements:
cudafeature- cuDNN library installed
Usage:
cargo build --release --features "cuda cudnn"
flash-attn
Enables FlashAttention V2 for faster attention computation. Significantly reduces memory usage and improves throughput.
Requirements:
cudafeature (automatically enabled)- GPU with compute capability >= 8.0 (Ampere or newer)
Compatible GPUs:
| Architecture | Compute Capability | Example GPUs |
|---|---|---|
| Ampere | 8.0, 8.6 | RTX 30 series, A100, A40 |
| Ada Lovelace | 8.9 | RTX 40 series, L40S |
| Blackwell | 10.0, 12.0 | RTX 50 series |
Usage:
cargo build --release --features "cuda flash-attn cudnn"
Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.
flash-attn-v3
Enables FlashAttention V3 for Hopper architecture GPUs. Provides additional performance improvements over V2 on supported hardware.
Requirements:
cudafeature (automatically enabled)- GPU with compute capability >= 9.0 (Hopper)
Compatible GPUs:
| Architecture | Compute Capability | Example GPUs |
|---|---|---|
| Hopper | 9.0 | H100, H800 |
Usage:
cargo build --release --features "cuda flash-attn-v3 cudnn"
Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.
metal
Enables Apple Metal GPU acceleration for macOS devices.
Requirements:
- macOS with Apple Silicon or AMD GPU
- macOS only (not available on Linux)
Usage:
cargo build --release --features metal
What it enables:
- GPU tensor operations via Metal
- PagedAttention on Metal devices (opt-in via
--paged-attn) - Quantized inference on Apple GPUs
Note: PagedAttention is disabled by default on Metal. Enable with
--paged-attnflag.
CPU Acceleration Features
accelerate
Enables Apple’s Accelerate framework for optimized CPU operations on macOS.
Requirements:
- macOS
Usage:
cargo build --release --features accelerate
# Or combined with Metal:
cargo build --release --features "metal accelerate"
mkl
Enables Intel Math Kernel Library (MKL) for optimized CPU operations.
Requirements:
- Intel MKL installed
- Intel CPU recommended (works on AMD but Intel-optimized)
Usage:
cargo build --release --features mkl
Distributed Inference Features
nccl
Enables multi-GPU distributed inference using NVIDIA NCCL (NVIDIA Collective Communications Library). Implements tensor parallelism for splitting large models across multiple GPUs.
Requirements:
cudafeature (automatically enabled)- Multiple NVIDIA GPUs
- NCCL library
- World size must be a power of 2 (1, 2, 4, 8, etc.)
Usage:
cargo build --release --features "cuda nccl"
# Run with specific GPU count
MISTRALRS_MN_LOCAL_WORLD_SIZE=2 mistralrs serve -m Qwen/Qwen3-30B-A3B-Instruct
Environment Variables:
| Variable | Description |
|---|---|
MISTRALRS_MN_LOCAL_WORLD_SIZE | Number of GPUs to use (defaults to all) |
MISTRALRS_NO_NCCL=1 | Disable NCCL and use device mapping instead |
Multi-node setup requires additional environment variables. See NCCL documentation for details.
Note: When NCCL is enabled, automatic device mapping is disabled.
ring
Enables distributed tensor-parallel inference using a TCP-based ring topology. Works across multiple machines without requiring NCCL.
Requirements:
- World size must be a power of 2 (2, 4, 8, etc.)
- TCP ports must be open between nodes
Usage:
cargo build --release --features ring
# Configure via JSON file
export RING_CONFIG=path/to/ring_config.json
mistralrs serve -m model-id
Configuration:
Create a JSON configuration file for each process:
{
"master_ip": "0.0.0.0",
"master_port": 1234,
"port": 12345,
"right_port": 12346,
"rank": 0,
"world_size": 2
}
| Field | Description |
|---|---|
master_ip | IP address for master node |
master_port | Port for master node |
port | Local port for incoming connections |
right_port | Port of right neighbor in ring |
right_ip | IP of right neighbor (optional, defaults to localhost) |
rank | Process rank (0 to world_size-1) |
world_size | Total number of processes (must be power of 2) |
See Ring documentation for detailed setup instructions.
Feature Combinations
Recommended Combinations by Hardware
| Hardware | Recommended Features |
|---|---|
| NVIDIA Ampere+ (RTX 30/40, A100) | cuda cudnn flash-attn |
| NVIDIA Hopper (H100) | cuda cudnn flash-attn-v3 |
| NVIDIA older GPUs | cuda cudnn |
| Apple Silicon | metal accelerate |
| Intel CPU | mkl |
| Generic CPU | (no features needed) |
| Multi-GPU NVIDIA | cuda cudnn flash-attn nccl |
| Multi-node/cross-platform | ring (plus GPU features) |
Installation Examples
# NVIDIA GPU with all optimizations
cargo install mistralrs-cli --features "cuda cudnn flash-attn"
# Apple Silicon
cargo install mistralrs-cli --features "metal accelerate"
# Intel CPU
cargo install mistralrs-cli --features "mkl"
# Multi-GPU NVIDIA setup
cargo install mistralrs-cli --features "cuda cudnn flash-attn nccl"
# Build from source with CUDA
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo build --release --features "cuda cudnn flash-attn"
Internal Features
These features are primarily for library development and are not typically used directly:
| Feature | Description |
|---|---|
pyo3_macros | Python bindings support (used by mistralrs-pyo3) |
utoipa | OpenAPI documentation generation |
Python Package Features
The Python SDK is distributed as separate packages with features pre-configured:
| Package | Equivalent Features |
|---|---|
mistralrs-cuda | cuda cudnn flash-attn |
mistralrs-metal | metal accelerate |
mistralrs-mkl | mkl |
mistralrs | CPU only |
pip install mistralrs-cuda # NVIDIA GPUs
pip install mistralrs-metal # Apple Silicon
pip install mistralrs-mkl # Intel CPUs
pip install mistralrs # Generic CPU
Troubleshooting
Diagnosing Issues
Use mistralrs doctor to diagnose your system configuration and verify features are working correctly:
mistralrs doctor
This command checks:
- Detected hardware (GPUs, CPU features)
- Installed libraries (CUDA, cuDNN, etc.)
- Feature compatibility
- Common configuration issues
Feature not working
-
Run
mistralrs doctorto check system configuration -
Verify the feature is enabled in your build:
cargo build --release --features "your-features" -v -
Check hardware compatibility (especially for flash-attn)
-
Ensure required libraries are installed (CUDA, cuDNN, MKL, etc.)
Conflicting features
flash-attnandflash-attn-v3are mutually exclusivemetalis macOS-only; don’t use withcudancclrequirescuda
Build errors
- CUDA not found: Ensure CUDA toolkit is installed and
nvccis in PATH - MKL not found: Install Intel oneAPI or standalone MKL
- Metal errors on Linux: Remove
metalfeature (macOS only)
See Troubleshooting for more solutions.