Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

FlashAttention in mistral.rs

Mistral.rs supports FlashAttention V2 and V3 on CUDA devices (V3 is only supported when CC >= 9.0).

Note: If compiled with FlashAttention and PagedAttention is enabled, then FlashAttention will be used in tandem to accelerate the prefill phase.

GPU Architecture Compatibility

ArchitectureCompute CapabilityExample GPUsFeature Flag
Ampere8.0, 8.6RTX 30*, A100, A40--features flash-attn
Ada Lovelace8.9RTX 40*, L40S--features flash-attn
Hopper9.0H100, H800--features flash-attn-v3
Blackwell10.0, 12.0RTX 50*--features flash-attn

Note: FlashAttention V2 and V3 are mutually exclusive Note: To use FlashAttention in the Python SDK, compile from source.