Structs§
- This layer has a weight that is parallelized along the output dimension, taking the “full” input dimension.
- This layer has no parallelization
- This layer has a weight that is parallelized along the input dimension, returning the “full” output dimension.
Functions§
- Compute the appropriate KV shard. This handles KV head replication. Be sure to use
compute_n_kv_groups
in tandem. - Compute the number of KV groups, taking into account KV head replication.