Academic

Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction

arXiv:2604.05438v1 Announce Type: new Abstract: Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware retrieval (e.g., Top-K selection) reduces this traffic by loading only a subset of KV pairs, but renormalizing the softmax over the subset introduces bias when attention mass is spread over unretrieved tokens. We propose a retrieval-completion attention module that keeps backbone weights and the KV-cache format unchanged. For each query, we compute exact attention over sink/tail anchors and the query-dependent retrieved Top-K tokens, and estimate the remaining mid-region numerator and denominator using a fixed-size feature-map summary computed at prefill time. We add the exact and estimated contributions in the unnormalized domain and apply a single normalization, recovering the missing softmax mass without additional attention-side KV reads. Across long-context benchmarks, the prop

Y
Yasuto Hoshi, Daisuke Miyashita, Jun Deguchi
· · 1 min read · 4 views

arXiv:2604.05438v1 Announce Type: new Abstract: Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware retrieval (e.g., Top-K selection) reduces this traffic by loading only a subset of KV pairs, but renormalizing the softmax over the subset introduces bias when attention mass is spread over unretrieved tokens. We propose a retrieval-completion attention module that keeps backbone weights and the KV-cache format unchanged. For each query, we compute exact attention over sink/tail anchors and the query-dependent retrieved Top-K tokens, and estimate the remaining mid-region numerator and denominator using a fixed-size feature-map summary computed at prefill time. We add the exact and estimated contributions in the unnormalized domain and apply a single normalization, recovering the missing softmax mass without additional attention-side KV reads. Across long-context benchmarks, the proposed method improves over selection-only Top-K at matched token-equivalent read budgets, with the largest gains in high-entropy heads.

Executive Summary

The paper addresses the critical bottleneck in long-context generative AI by optimizing key-value (KV) cache memory usage during inference. Current methods like Top-K retrieval reduce decode-time KV cache traffic but introduce bias when softmax renormalization overlooks unretrieved tokens. The authors propose a retrieval-completion attention module that preserves backbone weights and KV-cache format, computing exact attention over anchor tokens and retrieved Top-K tokens while estimating contributions from mid-region tokens using precomputed feature-map summaries. This approach recovers missing softmax mass without additional KV reads, achieving improved performance over selection-only Top-K in long-context benchmarks, particularly in high-entropy attention heads. The method holds significant promise for scalable long-context models by mitigating memory and computational overheads.

Key Points

  • The paper targets the KV-cache memory bottleneck in long-context generation, where decode-time KV cache traffic limits scalability.
  • Proposes a retrieval-completion attention module that computes exact attention over sink/tail anchors and Top-K tokens while estimating mid-region contributions via fixed-size feature-map summaries.
  • Demonstrates improved performance over Top-K retrieval alone in long-context benchmarks, with significant gains in high-entropy attention heads.
  • Preserves backbone weights and KV-cache format, avoiding structural modifications to the underlying model architecture.

Merits

Novel Attention Mechanism

Introduces a retrieval-completion attention module that accurately estimates softmax mass contributions from unretrieved tokens without additional KV reads, preserving computational efficiency.

Scalability for Long-Context Models

Addresses a critical bottleneck in long-context generation by reducing KV cache traffic, enabling more efficient inference for models with extended context windows.

Preservation of Model Architecture

Maintains backbone weights and KV-cache format, ensuring compatibility with existing model structures and avoiding disruptive architectural changes.

Demerits

Complexity in Estimation

Relies on fixed-size feature-map summaries for estimating mid-region contributions, which may introduce approximation errors in dynamic or highly variable contexts.

Dependence on Anchor Tokens

Performance may degrade if anchor tokens (sink/tail) are not representative or if the precomputed summaries fail to capture nuanced attention patterns.

Limited Benchmark Diversity

The evaluation is primarily based on long-context benchmarks, leaving questions about generalizability to other domains or tasks where KV cache traffic may not be the primary bottleneck.

Expert Commentary

This paper presents a sophisticated and timely solution to a pressing challenge in the deployment of long-context generative AI models. The authors’ retrieval-completion attention module elegantly balances the need for computational efficiency with the preservation of model accuracy, addressing a critical bottleneck that has hindered the scalability of transformer-based architectures. The method’s reliance on fixed-size feature-map summaries for estimating attention contributions is a particularly clever innovation, as it avoids the need for additional KV reads while still recovering missing softmax mass. However, the approach does introduce a degree of approximation, which may become more pronounced in highly dynamic or contextually complex scenarios. Future work could explore adaptive mechanisms for refining the feature-map summaries or validating the method across a broader range of tasks to ensure robustness. Overall, this work represents a significant advancement in the field of efficient attention mechanisms and holds substantial promise for enabling more scalable and practical long-context AI systems.

Recommendations

  • Further empirical validation is needed to assess the method’s robustness across diverse benchmarks and real-world applications, particularly in domains with highly variable attention patterns.
  • Explore hybrid approaches that combine retrieval-completion attention with other KV-cache optimization techniques, such as quantization or pruning, to further enhance efficiency.
  • Investigate the integration of this method into existing model architectures and frameworks to evaluate its practicality and adoption potential in production environments.

Sources

Original: arXiv - cs.LG