Function fp8_vector_quantize

Source
pub fn fp8_vector_quantize(input: &Tensor) -> Result<(Tensor, Tensor)>
Expand description

FP8 vector quantize.

  • Expects input to be f32, f16, or bf16
  • Returns a tuple of (quantized_weight, scales)
  • quantized_weight is fp8
  • scales is f32
  • Each scale corresponds to a vector of 128 elements