Summary
Add support for NVFP4-quantized Linear layers in phyai.
Motivation
NVFP4 is NVIDIA’s FP4 format for Blackwell Tensor Cores, targeting efficient low-precision inference with 4-bit values, 16-element block scaling, and FP8 scale factors. Since LLM inference is heavily dominated by Linear / GEMM workloads, supporting NVFP4 linear layers can reduce memory bandwidth and improve throughput on compatible NVIDIA Blackwell hardware.
References
Summary
Add support for NVFP4-quantized
Linearlayers inphyai.Motivation
NVFP4 is NVIDIA’s FP4 format for Blackwell Tensor Cores, targeting efficient low-precision inference with 4-bit values, 16-element block scaling, and FP8 scale factors. Since LLM inference is heavily dominated by
Linear/ GEMM workloads, supporting NVFP4 linear layers can reduce memory bandwidth and improve throughput on compatible NVIDIA Blackwell hardware.References