Academic

A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

arXiv:2602.17693v1 Announce Type: cross Abstract: Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized

arXiv:2602.17693v1 Announce Type: cross Abstract: Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.

Executive Summary

This article presents a case study on the effectiveness of Post-Training Quantization (PTQ) on Ascend NPU, a crucial step for efficient model deployment. The authors evaluate four distinct PTQ algorithms on reasoning-oriented models, revealing significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU. The study also highlights the limitations of dynamic quantization overheads in reducing end-to-end acceleration. These findings provide a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU, with implications for both practical and policy considerations.

Key Points

  • Significant platform sensitivity of PTQ algorithms on Ascend NPU
  • Viability of 4-bit weight-only quantization for larger models
  • Limitations of aggressive 4-bit weight-activation schemes on the NPU
  • Impact of dynamic quantization overheads on end-to-end acceleration

Merits

Robust experimental design

The authors conduct a comprehensive evaluation of four distinct PTQ algorithms, providing a robust experimental design that sheds light on the platform sensitivity of PTQ on Ascend NPU.

Real-world deployment demonstration

The authors demonstrate a real-world INT8 deployment, showcasing the practical feasibility of PTQ on Ascend NPU and highlighting the limitations of dynamic quantization overheads.

Demerits

Limited scope of evaluation

The study focuses on a specific set of models and algorithms, which may not be representative of the broader landscape of PTQ on Ascend NPU.

Lack of comparison to GPU architectures

The study does not provide a direct comparison to PTQ on GPU architectures, making it difficult to assess the relative performance of Ascend NPU.

Expert Commentary

This study provides a timely and comprehensive evaluation of PTQ on Ascend NPU, shedding light on the platform sensitivity of PTQ algorithms and the limitations of dynamic quantization overheads. However, the study's limited scope of evaluation and lack of comparison to GPU architectures restrict its generalizability. Nonetheless, the findings offer valuable insights for practitioners and policymakers, underscoring the need for further research and development of PTQ algorithms and techniques.

Recommendations

  • Future studies should aim to expand the scope of evaluation to include a broader range of models and algorithms, as well as a direct comparison to PTQ on GPU architectures.
  • Researchers should prioritize the development of PTQ algorithms and techniques that can effectively mitigate the limitations of dynamic quantization overheads and facilitate efficient model deployment on Ascend NPU.

Sources