ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs
arXiv:2602.17698v1 Announce Type: cross Abstract: Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling en
arXiv:2602.17698v1 Announce Type: cross Abstract: Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead.
Executive Summary
This article proposes ScaleBITS, a mixed-precision quantization framework for large language models (LLMs) that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. ScaleBITS introduces a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering, and formulates global bitwidth allocation as a constrained optimization problem. Experiments demonstrate that ScaleBITS significantly improves over uniform-precision quantization and outperforms state-of-the-art sensitivity-aware baselines in ultra-low-bit regime, without adding runtime overhead. This work addresses a critical challenge in post-training weight quantization and has significant implications for the deployment of LLMs in resource-constrained environments.
Key Points
- ▸ ScaleBITS proposes a mixed-precision quantization framework for LLMs
- ▸ The framework enables automated, fine-grained bitwidth allocation under a memory budget
- ▸ ScaleBITS introduces a hardware-aligned, block-wise weight partitioning scheme
Merits
Strength in Scalability
ScaleBITS develops a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation, and demonstrates significant improvements over existing solutions in ultra-low-bit regime.
Hardware Efficiency
The framework preserves hardware efficiency while allocating fine-grained bitwidth, which is essential for the deployment of LLMs in resource-constrained environments.
Demerits
Limited Generalizability
The evaluation of ScaleBITS is primarily focused on LLMs, and its generalizability to other deep learning models or applications is unclear.
Expert Commentary
The article presents a well-structured and well-executed approach to mixed-precision quantization for LLMs. The introduction of a hardware-aligned, block-wise weight partitioning scheme and the formulation of global bitwidth allocation as a constrained optimization problem are particularly noteworthy. However, the evaluation of ScaleBITS is primarily focused on LLMs, and its generalizability to other deep learning models or applications is unclear. Additionally, the article assumes a fixed memory budget, which may not be realistic in all scenarios. Nevertheless, ScaleBITS demonstrates significant improvements over existing solutions and has the potential to contribute to the efficient deployment of LLMs in resource-constrained environments.
Recommendations
- ✓ Future research should investigate the generalizability of ScaleBITS to other deep learning models or applications.
- ✓ The development of mixed-precision quantization techniques should prioritize the consideration of real-world constraints and limitations.