HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models
arXiv:2602.13710v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models enable instruction-following embodied control, but their large compute and memory footprints hinder deployment on resource-constrained robots and edge platforms. While reducing weights to 1-bit precision through binarization can greatly improve efficiency, existing methods fail to narrow the distribution gap between binarized and full-precision weights, causing quantization errors to accumulate under long-horizon closed-loop execution and severely degrade actions. To fill this gap, we propose HBVLA, a VLA-tailored binarization framework. First, we use a policy-aware enhanced Hessian to identify weights that are truly critical for action generation. Then, we employ a sparse orthogonal transform for non-salient weights to induce a low-entropy intermediate state. Finally, we quantize both salient and non-salient weights in the Harr domain with group-wise 1-bit quantization. We have evaluated our approach
arXiv:2602.13710v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models enable instruction-following embodied control, but their large compute and memory footprints hinder deployment on resource-constrained robots and edge platforms. While reducing weights to 1-bit precision through binarization can greatly improve efficiency, existing methods fail to narrow the distribution gap between binarized and full-precision weights, causing quantization errors to accumulate under long-horizon closed-loop execution and severely degrade actions. To fill this gap, we propose HBVLA, a VLA-tailored binarization framework. First, we use a policy-aware enhanced Hessian to identify weights that are truly critical for action generation. Then, we employ a sparse orthogonal transform for non-salient weights to induce a low-entropy intermediate state. Finally, we quantize both salient and non-salient weights in the Harr domain with group-wise 1-bit quantization. We have evaluated our approach on different VLAs: on LIBERO, quantized OpenVLA-OFT retains 92.2% of full-precision performance; on SimplerEnv, quantized CogAct retains 93.6%, significantly outperforming state-of-the-art binarization methods. We further validate our method on real-world evaluation suite and the results show that HBVLA incurs only marginal success-rate degradation compared to the full-precision model, demonstrating robust deployability under tight hardware constraints. Our work provides a practical foundation for ultra-low-bit quantization of VLAs, enabling more reliable deployment on hardware-limited robotic platforms.
Executive Summary
The article 'HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models' addresses the challenge of deploying Vision-Language-Action (VLA) models on resource-constrained platforms by proposing a novel binarization framework. The authors introduce HBVLA, which employs a policy-aware enhanced Hessian to identify critical weights, a sparse orthogonal transform for non-salient weights, and group-wise 1-bit quantization in the Harr domain. Evaluations on LIBERO and SimplerEnv demonstrate significant performance retention compared to full-precision models, outperforming state-of-the-art methods. The study highlights the potential for reliable deployment of VLAs on hardware-limited robotic platforms.
Key Points
- ▸ Introduction of HBVLA framework for 1-bit quantization of VLA models.
- ▸ Use of policy-aware enhanced Hessian to identify critical weights.
- ▸ Employment of sparse orthogonal transform for non-salient weights.
- ▸ Group-wise 1-bit quantization in the Harr domain.
- ▸ High performance retention in evaluations on LIBERO and SimplerEnv.
Merits
Innovative Approach
The HBVLA framework introduces a novel method for 1-bit quantization tailored specifically for VLA models, addressing a critical gap in the field.
High Performance Retention
The framework demonstrates significant performance retention compared to full-precision models, making it highly effective for resource-constrained platforms.
Comprehensive Evaluation
The study includes evaluations on multiple platforms and real-world scenarios, providing robust validation of the method's effectiveness.
Demerits
Complexity
The method involves multiple steps and transformations, which may increase the complexity of implementation and deployment.
Limited Scope
The evaluations are focused on specific VLA models and environments, which may limit the generalizability of the findings.
Potential Overhead
The use of Hessian and orthogonal transforms may introduce computational overhead, which could offset some of the benefits of 1-bit quantization.
Expert Commentary
The article presents a significant advancement in the field of model quantization, specifically tailored for Vision-Language-Action models. The HBVLA framework addresses a critical challenge in deploying these models on resource-constrained platforms by introducing a novel approach that combines policy-aware weight identification, sparse orthogonal transforms, and group-wise 1-bit quantization. The evaluations on LIBERO and SimplerEnv demonstrate impressive performance retention, outperforming state-of-the-art methods. However, the complexity of the method and the potential for computational overhead are notable limitations. The study's focus on specific VLA models and environments may also limit its generalizability. Nonetheless, the work provides a robust foundation for further research and practical applications in edge computing and robotics. The implications for policy and practice are substantial, highlighting the need for continued investment in model compression techniques to enable broader adoption of advanced AI technologies in resource-limited settings.
Recommendations
- ✓ Further research should explore the generalizability of the HBVLA framework to other types of machine learning models and environments.
- ✓ Future studies should investigate methods to reduce the computational overhead associated with the Hessian and orthogonal transforms, making the framework more efficient for real-world deployment.