Academic

CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

arXiv:2602.20732v1 Announce Type: new Abstract: Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up t

C
Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis
· · 1 min read · 19 views

arXiv:2602.20732v1 Announce Type: new Abstract: Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other strong baselines. Code is available at \href{https://anonymous.4open.science/r/CHESS-9958/}{https://anonymous.4open.science/r/CHESS/}.

Executive Summary

The proposed CHESS system addresses the limitations of long-context LLM inference by introducing a context-aware, hierarchical selection policy. This approach enables accurate inference at low latency, outperforming other baselines while using only 1% of the KV cache. The system achieves up to 4.56 times higher throughput, demonstrating its potential for practical acceleration. The CHESS system is an algorithm-system co-design that dynamically reconstructs a coherent context for the current decoding, eliminating expensive data movement and realizing theoretical sparsity.

Key Points

  • CHESS introduces a context-aware, hierarchical selection policy for KV-cache management
  • The system achieves low-latency stable inference with up to 4.56 times higher throughput
  • CHESS surpasses Full-KV quality using only 1% of the KV cache

Merits

Efficient KV-Cache Management

The CHESS system efficiently manages the KV cache, reducing the overhead of data movement and selection

High-Quality Inference

CHESS achieves high-quality inference, surpassing Full-KV quality while using significantly less cache

Demerits

Complexity of Implementation

The CHESS system may require significant modifications to existing LLM architectures, which could be complex and time-consuming to implement

Expert Commentary

The CHESS system represents a significant advancement in the field of LLM inference optimization. By introducing a context-aware, hierarchical selection policy, the system is able to efficiently manage the KV cache and achieve high-quality inference at low latency. The implications of this research are far-reaching, with potential applications in a wide range of fields, including natural language processing and machine learning. However, the complexity of implementation may be a significant challenge, requiring careful consideration and modification of existing LLM architectures.

Recommendations

  • Further research is needed to explore the potential applications of the CHESS system in various fields
  • The development of more efficient and scalable LLM architectures should be prioritized, with a focus on optimizing inference performance

Sources