Academic

BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection

arXiv:2603.19635v1 Announce Type: new Abstract: The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER ma

arXiv:2603.19635v1 Announce Type: new Abstract: The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at https://cslikai.cn/BEAVER/.

Executive Summary

The article proposes a novel training-free framework called BEAVER, which addresses the bottlenecks in inference latency and information utilization in large language models (LLMs) by shifting compression from linear token removal to structure-aware hierarchical selection. BEAVER employs dual-path pooling and a hybrid planner to maximize hardware parallelism and preserve discourse integrity. The framework achieves comparable performance to state-of-the-art methods on four long-context benchmarks and reduces latency by 26.4x on 128k contexts. The authors claim that BEAVER offers a scalable solution for high-throughput applications, and the code is available for public access. While the article demonstrates the efficacy of BEAVER, further research is necessary to explore its adaptability across various LLM architectures and applications.

Key Points

  • BEAVER is a training-free framework for hierarchical prompt compression.
  • BEAVER employs structure-aware hierarchical selection to maximize hardware parallelism.
  • BEAVER demonstrates comparable performance to state-of-the-art methods on four long-context benchmarks.

Merits

Scalability

BEAVER's ability to reduce latency by 26.4x on 128k contexts makes it a scalable solution for high-throughput applications.

Preservation of Discourse Integrity

BEAVER's hybrid planner preserves discourse integrity through a combination of semantic and lexical dual-branch selection and sentence smoothing.

Training-Free

BEAVER eliminates the need for extensive training, making it a cost-effective solution for LLM compression.

Demerits

Limited Adaptability

Further research is necessary to explore BEAVER's adaptability across various LLM architectures and applications.

Complexity

BEAVER's structure-aware hierarchical selection and hybrid planner may introduce additional complexity in implementation and maintenance.

Expert Commentary

The article presents a novel and interesting approach to LLM compression, addressing the pressing issue of latency and information utilization. While BEAVER's performance is comparable to state-of-the-art methods, further research is necessary to explore its adaptability and complexity. The authors' claim of scalability and efficiency is promising, but it remains to be seen whether BEAVER can be successfully implemented in real-world applications. Nevertheless, the article contributes to the ongoing discussion on efficient inference in LLMs and paves the way for future research in this area.

Recommendations

  • Future research should focus on exploring BEAVER's adaptability across various LLM architectures and applications.
  • The authors should provide more detailed information on the implementation and maintenance of BEAVER's structure-aware hierarchical selection and hybrid planner.

Sources

Original: arXiv - cs.CL