Function sdpa

pub fn sdpa(
    q: &Tensor,
    k: &Tensor,
    v: &Tensor,
    scale: f32,
    softcapping: f32,
) -> Result<Tensor>

Expand description

Scaled dot product attention with a fused kernel.

Computes softmax(qk^T*scale)v.

Inputs shapes:

Output shape: (bs, qhead, seq, v_hidden)

Supported head dims: 32, 64, 96, 128, 256.

If seq == 1:
- Use a vectorized kernel
- Supports seq != kv_seq (cross attn. support)
- Supports GQA when qhead is a multiple of kv_head
Otherwise:
- Use an alternate kernel
- Requires seq == kv_seq
- GQA is not supported (requires qhead == kv_head)

Function sdpaCopy item path