Expand description
Custom GEMV (General Matrix-Vector multiplication) for decode-phase inference.
This module provides an optimized GEMV kernel that replaces cuBLAS for small batch sizes (1-8) where cuBLAS GEMM overhead is significant.
Key optimizations:
- Vectorized loads (half2, nv_bfloat162, float2)
- __ldg() for read-only cache path (L2 cache handles x reuse)
- Warp-level reduction using XOR shuffle
- Static shared memory for block-level reduction
- Supports batch sizes 1-8 efficiently
Structs§
- Gemv
Controller - Controller for enabling/disabling custom GEMV kernel.
Constants§
- MAX_
GEMV_ BATCH_ SIZE - Maximum batch size supported by the GEMV kernel
Statics§
- GEMV_
CONTROLLER - Global controller for the custom GEMV kernel.
Functions§
- gemv
- Fallback for non-CUDA builds
- should_
use_ gemv - Fallback for non-CUDA builds