Fast NF4 Dequantization Kernels for Large Language Model Inference
arXiv:2604.02556v1 Announce Type: new Abstract: Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. …
Xiangbo Qi, Chaoyi Jiang, Murali Annavaram
13 views