Module gemv

mistralrs_quant

Module gemv

Expand description

Custom GEMV (General Matrix-Vector multiplication) for decode-phase inference.

This module provides an optimized GEMV kernel that replaces cuBLAS for small batch sizes (1-8) where cuBLAS GEMM overhead is significant.

Key optimizations:

Vectorized loads (half2, nv_bfloat162, float2)
__ldg() for read-only cache path (L2 cache handles x reuse)
Warp-level reduction using XOR shuffle
Static shared memory for block-level reduction
Supports batch sizes 1-8 efficiently

Structs§

GemvController: Controller for enabling/disabling custom GEMV kernel.

Constants§

MAX_GEMV_BATCH_SIZE: Maximum batch size supported by the GEMV kernel

Statics§

GEMV_CONTROLLER: Global controller for the custom GEMV kernel.

Functions§

gemv: Fallback for non-CUDA builds
should_use_gemv: Fallback for non-CUDA builds