Academic

CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

arXiv:2602.20732v1 Announce Type: new Abstract: Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up t

Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis · March 2, 2026 · 1 min read · 39 views

#cs.AI

Executive Summary

The proposed CHESS system addresses the limitations of long-context LLM inference by introducing a context-aware, hierarchical selection policy. This approach enables accurate inference at low latency, outperforming other baselines while using only 1% of the KV cache. The system achieves up to 4.56 times higher throughput, demonstrating its potential for practical acceleration. The CHESS system is an algorithm-system co-design that dynamically reconstructs a coherent context for the current decoding, eliminating expensive data movement and realizing theoretical sparsity.

Key Points

▸ CHESS introduces a context-aware, hierarchical selection policy for KV-cache management
▸ The system achieves low-latency stable inference with up to 4.56 times higher throughput
▸ CHESS surpasses Full-KV quality using only 1% of the KV cache

Merits

Efficient KV-Cache Management

The CHESS system efficiently manages the KV cache, reducing the overhead of data movement and selection

High-Quality Inference

CHESS achieves high-quality inference, surpassing Full-KV quality while using significantly less cache

Demerits

Complexity of Implementation

The CHESS system may require significant modifications to existing LLM architectures, which could be complex and time-consuming to implement

Expert Commentary

The CHESS system represents a significant advancement in the field of LLM inference optimization. By introducing a context-aware, hierarchical selection policy, the system is able to efficiently manage the KV cache and achieve high-quality inference at low latency. The implications of this research are far-reaching, with potential applications in a wide range of fields, including natural language processing and machine learning. However, the complexity of implementation may be a significant challenge, requiring careful consideration and modification of existing LLM architectures.

Recommendations

✓ Further research is needed to explore the potential applications of the CHESS system in various fields
✓ The development of more efficient and scalable LLM architectures should be prioritized, with a focus on optimizing inference performance

Sources

arXiv - cs.AI

CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

AI Commentary

Executive Summary

Key Points

Merits

Efficient KV-Cache Management

High-Quality Inference

Demerits

Complexity of Implementation

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs