Academic

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

arXiv:2602.12618v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules,

arXiv:2602.12618v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.

Executive Summary

The article introduces Attention-Driven Self-Compression (ADSC), a novel method for reducing the computational cost of Multimodal Large Language Models (MLLMs) by leveraging the LLM's attention mechanism to progressively downsample vision tokens. Unlike prior methods, ADSC operates within the LLM, avoiding compatibility issues with FlashAttention and auxiliary modules. Applied to LLaVA-1.5, ADSC achieves significant reductions in FLOPs and memory usage while maintaining high performance. The method outperforms existing pruning techniques, especially under high compression ratios, demonstrating robustness and efficiency.

Key Points

  • ADSC reduces vision tokens using the LLM's attention mechanism, avoiding the need for auxiliary modules or attention modification.
  • The method is compatible with FlashAttention and broadly applicable across different LLM architectures.
  • ADSC achieves a 53.7% reduction in FLOPs and a 56.7% reduction in peak KV-cache memory while preserving 98.2% of the original model's performance.
  • The method outperforms prior pruning approaches in both efficiency and accuracy, particularly under high compression ratios.

Merits

Innovative Approach

ADSC introduces a novel approach to token reduction that leverages the LLM's attention mechanism, making it more compatible and efficient compared to prior methods.

Broad Applicability

The method is broadly applicable across different LLM architectures and does not require modifications to the attention mechanism or auxiliary modules.

High Performance

ADSC maintains high performance levels while significantly reducing computational costs, making it a practical solution for efficient MLLMs.

Demerits

Potential Generalization

While ADSC shows promise, its effectiveness may vary across different types of MLLMs and tasks, requiring further validation in diverse scenarios.

Implementation Complexity

The implementation of ADSC may require careful tuning and optimization to ensure it works effectively across different LLM architectures and datasets.

Expert Commentary

The introduction of Attention-Driven Self-Compression (ADSC) represents a significant advancement in the field of model compression for Multimodal Large Language Models (MLLMs). By leveraging the LLM's attention mechanism to progressively downsample vision tokens, ADSC avoids the pitfalls of prior methods that rely on heuristics or auxiliary modules. The method's compatibility with FlashAttention and its broad applicability across different LLM architectures make it a versatile and practical solution. The substantial reductions in FLOPs and memory usage, coupled with the preservation of high performance, highlight the potential of ADSC to become a standard technique in the optimization of MLLMs. However, further research is needed to validate its effectiveness across diverse scenarios and ensure its robustness in various applications. The method's innovative approach and impressive results position it as a key contributor to the ongoing efforts to make large language models more efficient and accessible.

Recommendations

  • Further validation of ADSC across a wider range of MLLMs and tasks to ensure its generalizability and robustness.
  • Exploration of potential optimizations and tuning strategies to enhance the method's performance and compatibility with different LLM architectures.

Sources