Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Cargo Features Reference

This document provides a complete reference for all cargo features available in mistral.rs.

Quick Reference

FeatureDescriptionPlatformRequires
cudaNVIDIA GPU accelerationLinux, WindowsCUDA toolkit
cudnnNVIDIA cuDNN backendLinux, Windowscuda, cuDNN
flash-attnFlashAttention V2Linux, Windowscuda, CC >= 8.0
flash-attn-v3FlashAttention V3Linux, Windowscuda, CC >= 9.0
metalApple GPU accelerationmacOS-
accelerateApple CPU accelerationmacOS-
mklIntel MKL accelerationLinux, WindowsIntel MKL
ncclMulti-GPU (NVIDIA NCCL)Linuxcuda, NCCL
ringMulti-GPU/node (TCP ring)All-

GPU Acceleration Features

cuda

Enables NVIDIA GPU acceleration via CUDA. This is the primary feature for running on NVIDIA GPUs.

Requirements:

  • NVIDIA GPU
  • CUDA toolkit installed
  • Linux or Windows (WSL supported)

Usage:

cargo build --release --features cuda
cargo install mistralrs-cli --features cuda

What it enables:

  • GPU tensor operations via CUDA
  • PagedAttention on CUDA devices
  • Quantized inference on GPU

cudnn

Enables NVIDIA cuDNN for optimized neural network primitives. Provides faster convolutions and other operations.

Requirements:

  • cuda feature
  • cuDNN library installed

Usage:

cargo build --release --features "cuda cudnn"

flash-attn

Enables FlashAttention V2 for faster attention computation. Significantly reduces memory usage and improves throughput.

Requirements:

  • cuda feature (automatically enabled)
  • GPU with compute capability >= 8.0 (Ampere or newer)

Compatible GPUs:

ArchitectureCompute CapabilityExample GPUs
Ampere8.0, 8.6RTX 30 series, A100, A40
Ada Lovelace8.9RTX 40 series, L40S
Blackwell10.0, 12.0RTX 50 series

Usage:

cargo build --release --features "cuda flash-attn cudnn"

Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.


flash-attn-v3

Enables FlashAttention V3 for Hopper architecture GPUs. Provides additional performance improvements over V2 on supported hardware.

Requirements:

  • cuda feature (automatically enabled)
  • GPU with compute capability >= 9.0 (Hopper)

Compatible GPUs:

ArchitectureCompute CapabilityExample GPUs
Hopper9.0H100, H800

Usage:

cargo build --release --features "cuda flash-attn-v3 cudnn"

Note: FlashAttention V2 and V3 are mutually exclusive. Do not enable both.


metal

Enables Apple Metal GPU acceleration for macOS devices.

Requirements:

  • macOS with Apple Silicon or AMD GPU
  • macOS only (not available on Linux)

Usage:

cargo build --release --features metal

What it enables:

  • GPU tensor operations via Metal
  • PagedAttention on Metal devices (opt-in via --paged-attn)
  • Quantized inference on Apple GPUs

Note: PagedAttention is disabled by default on Metal. Enable with --paged-attn flag.


CPU Acceleration Features

accelerate

Enables Apple’s Accelerate framework for optimized CPU operations on macOS.

Requirements:

  • macOS

Usage:

cargo build --release --features accelerate
# Or combined with Metal:
cargo build --release --features "metal accelerate"

mkl

Enables Intel Math Kernel Library (MKL) for optimized CPU operations.

Requirements:

  • Intel MKL installed
  • Intel CPU recommended (works on AMD but Intel-optimized)

Usage:

cargo build --release --features mkl

Distributed Inference Features

nccl

Enables multi-GPU distributed inference using NVIDIA NCCL (NVIDIA Collective Communications Library). Implements tensor parallelism for splitting large models across multiple GPUs.

Requirements:

  • cuda feature (automatically enabled)
  • Multiple NVIDIA GPUs
  • NCCL library
  • World size must be a power of 2 (1, 2, 4, 8, etc.)

Usage:

cargo build --release --features "cuda nccl"

# Run with specific GPU count
MISTRALRS_MN_LOCAL_WORLD_SIZE=2 mistralrs serve -m Qwen/Qwen3-30B-A3B-Instruct

Environment Variables:

VariableDescription
MISTRALRS_MN_LOCAL_WORLD_SIZENumber of GPUs to use (defaults to all)
MISTRALRS_NO_NCCL=1Disable NCCL and use device mapping instead

Multi-node setup requires additional environment variables. See NCCL documentation for details.

Note: When NCCL is enabled, automatic device mapping is disabled.


ring

Enables distributed tensor-parallel inference using a TCP-based ring topology. Works across multiple machines without requiring NCCL.

Requirements:

  • World size must be a power of 2 (2, 4, 8, etc.)
  • TCP ports must be open between nodes

Usage:

cargo build --release --features ring

# Configure via JSON file
export RING_CONFIG=path/to/ring_config.json
mistralrs serve -m model-id

Configuration:

Create a JSON configuration file for each process:

{
  "master_ip": "0.0.0.0",
  "master_port": 1234,
  "port": 12345,
  "right_port": 12346,
  "rank": 0,
  "world_size": 2
}
FieldDescription
master_ipIP address for master node
master_portPort for master node
portLocal port for incoming connections
right_portPort of right neighbor in ring
right_ipIP of right neighbor (optional, defaults to localhost)
rankProcess rank (0 to world_size-1)
world_sizeTotal number of processes (must be power of 2)

See Ring documentation for detailed setup instructions.


Feature Combinations

HardwareRecommended Features
NVIDIA Ampere+ (RTX 30/40, A100)cuda cudnn flash-attn
NVIDIA Hopper (H100)cuda cudnn flash-attn-v3
NVIDIA older GPUscuda cudnn
Apple Siliconmetal accelerate
Intel CPUmkl
Generic CPU(no features needed)
Multi-GPU NVIDIAcuda cudnn flash-attn nccl
Multi-node/cross-platformring (plus GPU features)

Installation Examples

# NVIDIA GPU with all optimizations
cargo install mistralrs-cli --features "cuda cudnn flash-attn"

# Apple Silicon
cargo install mistralrs-cli --features "metal accelerate"

# Intel CPU
cargo install mistralrs-cli --features "mkl"

# Multi-GPU NVIDIA setup
cargo install mistralrs-cli --features "cuda cudnn flash-attn nccl"

# Build from source with CUDA
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo build --release --features "cuda cudnn flash-attn"

Internal Features

These features are primarily for library development and are not typically used directly:

FeatureDescription
pyo3_macrosPython bindings support (used by mistralrs-pyo3)
utoipaOpenAPI documentation generation

Python Package Features

The Python SDK is distributed as separate packages with features pre-configured:

PackageEquivalent Features
mistralrs-cudacuda cudnn flash-attn
mistralrs-metalmetal accelerate
mistralrs-mklmkl
mistralrsCPU only
pip install mistralrs-cuda    # NVIDIA GPUs
pip install mistralrs-metal   # Apple Silicon
pip install mistralrs-mkl     # Intel CPUs
pip install mistralrs         # Generic CPU

Troubleshooting

Diagnosing Issues

Use mistralrs doctor to diagnose your system configuration and verify features are working correctly:

mistralrs doctor

This command checks:

  • Detected hardware (GPUs, CPU features)
  • Installed libraries (CUDA, cuDNN, etc.)
  • Feature compatibility
  • Common configuration issues

Feature not working

  1. Run mistralrs doctor to check system configuration

  2. Verify the feature is enabled in your build:

    cargo build --release --features "your-features" -v
    
  3. Check hardware compatibility (especially for flash-attn)

  4. Ensure required libraries are installed (CUDA, cuDNN, MKL, etc.)

Conflicting features

  • flash-attn and flash-attn-v3 are mutually exclusive
  • metal is macOS-only; don’t use with cuda
  • nccl requires cuda

Build errors

  • CUDA not found: Ensure CUDA toolkit is installed and nvcc is in PATH
  • MKL not found: Install Intel oneAPI or standalone MKL
  • Metal errors on Linux: Remove metal feature (macOS only)

See Troubleshooting for more solutions.