Function fp8_blockwise_quantize

Source
pub fn fp8_blockwise_quantize(
    input: &Tensor,
    weight_block_size: Vec<usize>,
) -> Result<(Tensor, Tensor)>
Expand description

FP8 blockwise quantize.

  • Expects input to be f32, f16, or bf16
  • Returns a tuple of (quantized_weight, scales)
  • quantized_weight is fp8
  • scales is f32