CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
arXiv:2602.20732v1 Announce Type: new Abstract: Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up t
arXiv:2602.20732v1 Announce Type: new Abstract: Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other strong baselines. Code is available at \href{https://anonymous.4open.science/r/CHESS-9958/}{https://anonymous.4open.science/r/CHESS/}.
Executive Summary
The proposed CHESS system addresses the limitations of long-context LLM inference by introducing a context-aware, hierarchical selection policy. This approach enables accurate inference at low latency, outperforming other baselines while using only 1% of the KV cache. The system achieves up to 4.56 times higher throughput, demonstrating its potential for practical acceleration. The CHESS system is an algorithm-system co-design that dynamically reconstructs a coherent context for the current decoding, eliminating expensive data movement and realizing theoretical sparsity.
Key Points
- ▸ CHESS introduces a context-aware, hierarchical selection policy for KV-cache management
- ▸ The system achieves low-latency stable inference with up to 4.56 times higher throughput
- ▸ CHESS surpasses Full-KV quality using only 1% of the KV cache
Merits
Efficient KV-Cache Management
The CHESS system efficiently manages the KV cache, reducing the overhead of data movement and selection
High-Quality Inference
CHESS achieves high-quality inference, surpassing Full-KV quality while using significantly less cache
Demerits
Complexity of Implementation
The CHESS system may require significant modifications to existing LLM architectures, which could be complex and time-consuming to implement
Expert Commentary
The CHESS system represents a significant advancement in the field of LLM inference optimization. By introducing a context-aware, hierarchical selection policy, the system is able to efficiently manage the KV cache and achieve high-quality inference at low latency. The implications of this research are far-reaching, with potential applications in a wide range of fields, including natural language processing and machine learning. However, the complexity of implementation may be a significant challenge, requiring careful consideration and modification of existing LLM architectures.
Recommendations
- ✓ Further research is needed to explore the potential applications of the CHESS system in various fields
- ✓ The development of more efficient and scalable LLM architectures should be prioritized, with a focus on optimizing inference performance