Skip to content

Use flash attention

Flash attention is a fused attention kernel that reduces memory traffic. mistral.rs supports two versions:

  • flash-attn (v2): compute capability 8.0+ (Ampere and newer).
  • flash-attn-v3: compute capability 9.0 (Hopper) only.

Flash attention is a Cargo feature. The install script enables it when a supported GPU is detected. From source:

Terminal window
# Ampere, Ada, older Hopper
cargo install --path mistralrs-cli --features "cuda flash-attn cudnn"
# Hopper (H100), for v3
cargo install --path mistralrs-cli --features "cuda flash-attn flash-attn-v3 cudnn"

mistralrs doctor lists compiled features.

Flash and paged attention compose. Both can be on simultaneously.

See the paged attention guide.