Module gemv

Module gemv 

Source
Expand description

Custom GEMV (General Matrix-Vector multiplication) for decode-phase inference.

This module provides an optimized GEMV kernel that replaces cuBLAS for small batch sizes (1-8) where cuBLAS GEMM overhead is significant.

Key optimizations:

  • Vectorized loads (half2, nv_bfloat162, float2)
  • __ldg() for read-only cache path (L2 cache handles x reuse)
  • Warp-level reduction using XOR shuffle
  • Static shared memory for block-level reduction
  • Supports batch sizes 1-8 efficiently

Structs§

GemvController
Controller for enabling/disabling custom GEMV kernel.

Constants§

MAX_GEMV_BATCH_SIZE
Maximum batch size supported by the GEMV kernel

Statics§

GEMV_CONTROLLER
Global controller for the custom GEMV kernel.

Functions§

gemv
Fallback for non-CUDA builds
should_use_gemv
Fallback for non-CUDA builds