Use flash attention
Flash attention is a fused attention kernel that reduces memory traffic. mistral.rs supports two versions:
- flash-attn (v2): compute capability 8.0+ (Ampere and newer).
- flash-attn-v3: compute capability 9.0 (Hopper) only.
Enabling at build time
Section titled “Enabling at build time”Flash attention is a Cargo feature. The install script enables it when a supported GPU is detected. From source:
# Ampere, Ada, older Hoppercargo install --path mistralrs-cli --features "cuda flash-attn cudnn"
# Hopper (H100), for v3cargo install --path mistralrs-cli --features "cuda flash-attn flash-attn-v3 cudnn"mistralrs doctor lists compiled features.
Composition with paged attention
Section titled “Composition with paged attention”Flash and paged attention compose. Both can be on simultaneously, but they are not the same backend:
flash-attnandflash-attn-v3are Cargo features for the standard attention path and fallback varlen paths.- FlashInfer paged decode and prefill kernels are built with the
cudafeature as part of PagedAttention.
On CUDA with PagedAttention enabled, mistral.rs uses the FlashInfer paged layout and decode kernel for compatible KV caches by default. Set MISTRALRS_FLASHINFER_DECODE=0 only when debugging or comparing against the generic paged path.
See the paged attention guide.