Academic

HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

arXiv:2602.23699v1 Announce Type: cross Abstract: The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding

arXiv:2602.23699v1 Announce Type: cross Abstract: The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT-NLP/HiDrop.

Executive Summary

This article proposes HiDrop, a novel framework that addresses the quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs). By aligning token pruning with the hierarchical function of MLLM layers, HiDrop achieves significant efficiency improvements. Key innovations include Late Injection, Concave Pyramid Pruning with Early Exit, and optimized token selection. The framework is further enhanced with persistent positional encoding and parallel decoupling of vision computation. Extensive experiments demonstrate HiDrop's ability to compress 90% of visual tokens while matching original performance and accelerating training by 1.72 times. This breakthrough has far-reaching implications for efficient MLLM training and inference, and provides valuable insights into the hierarchical nature of multimodal fusion.

Key Points

  • HiDrop addresses the quadratic computational cost of processing vision tokens in MLLMs
  • Proposes innovative Late Injection, Concave Pyramid Pruning with Early Exit, and optimized token selection
  • Achieves significant efficiency improvements, compressing 90% of visual tokens while matching original performance

Merits

Strength

HiDrop's ability to align token pruning with the hierarchical function of MLLM layers enables efficient pruning while preserving model performance.

Innovative Approach

The Late Injection and Concave Pyramid Pruning with Early Exit mechanisms provide a novel and effective solution to the computational cost challenge in MLLMs.

Efficiency Improvements

HiDrop's optimized token selection and parallel decoupling of vision computation further enhance efficiency, accelerating training by 1.72 times.

Demerits

Limitation

The framework's complexity and potential overhead associated with dynamic token reduction may pose challenges for widespread adoption.

Scalability

The applicability and effectiveness of HiDrop in larger-scale MLLM applications remain to be investigated.

Expert Commentary

The proposed framework, HiDrop, represents a significant advancement in addressing the computational cost challenge in MLLMs. By leveraging innovative Late Injection and Concave Pyramid Pruning with Early Exit mechanisms, HiDrop achieves remarkable efficiency improvements while preserving model performance. The framework's optimization techniques and parallel decoupling of vision computation further enhance its efficiency. While the framework's complexity and scalability remain concerns, the potential benefits of HiDrop make it an exciting development in the field of MLLM research.

Recommendations

  • Further investigation into the scalability and applicability of HiDrop in larger-scale MLLM applications is warranted.
  • The framework's potential for widespread adoption in real-world applications underscores the need for careful consideration of its complexity and overhead.

Sources