Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction
arXiv:2604.05438v1 Announce Type: new Abstract: Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware retrieval (e.g., Top-K selection) reduces this traffic by loading only a subset of KV pairs, but renormalizing the softmax over the subset introduces bias when attention mass is spread over unretrieved tokens. We propose a retrieval-completion attention module that keeps backbone weights and the KV-cache format unchanged. For each query, we compute exact attention over sink/tail anchors and the query-dependent retrieved Top-K tokens, and estimate the remaining mid-region numerator and denominator using a fixed-size feature-map summary computed at prefill time. We add the exact and estimated contributions in the unnormalized domain and apply a single normalization, recovering the missing softmax mass without additional attention-side KV reads. Across long-context benchmarks, the prop
arXiv:2604.05438v1 Announce Type: new Abstract: Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware retrieval (e.g., Top-K selection) reduces this traffic by loading only a subset of KV pairs, but renormalizing the softmax over the subset introduces bias when attention mass is spread over unretrieved tokens. We propose a retrieval-completion attention module that keeps backbone weights and the KV-cache format unchanged. For each query, we compute exact attention over sink/tail anchors and the query-dependent retrieved Top-K tokens, and estimate the remaining mid-region numerator and denominator using a fixed-size feature-map summary computed at prefill time. We add the exact and estimated contributions in the unnormalized domain and apply a single normalization, recovering the missing softmax mass without additional attention-side KV reads. Across long-context benchmarks, the proposed method improves over selection-only Top-K at matched token-equivalent read budgets, with the largest gains in high-entropy heads.
Executive Summary
The paper addresses the critical bottleneck in long-context generative AI by optimizing key-value (KV) cache memory usage during inference. Current methods like Top-K retrieval reduce decode-time KV cache traffic but introduce bias when softmax renormalization overlooks unretrieved tokens. The authors propose a retrieval-completion attention module that preserves backbone weights and KV-cache format, computing exact attention over anchor tokens and retrieved Top-K tokens while estimating contributions from mid-region tokens using precomputed feature-map summaries. This approach recovers missing softmax mass without additional KV reads, achieving improved performance over selection-only Top-K in long-context benchmarks, particularly in high-entropy attention heads. The method holds significant promise for scalable long-context models by mitigating memory and computational overheads.
Key Points
- ▸ The paper targets the KV-cache memory bottleneck in long-context generation, where decode-time KV cache traffic limits scalability.
- ▸ Proposes a retrieval-completion attention module that computes exact attention over sink/tail anchors and Top-K tokens while estimating mid-region contributions via fixed-size feature-map summaries.
- ▸ Demonstrates improved performance over Top-K retrieval alone in long-context benchmarks, with significant gains in high-entropy attention heads.
- ▸ Preserves backbone weights and KV-cache format, avoiding structural modifications to the underlying model architecture.
Merits
Novel Attention Mechanism
Introduces a retrieval-completion attention module that accurately estimates softmax mass contributions from unretrieved tokens without additional KV reads, preserving computational efficiency.
Scalability for Long-Context Models
Addresses a critical bottleneck in long-context generation by reducing KV cache traffic, enabling more efficient inference for models with extended context windows.
Preservation of Model Architecture
Maintains backbone weights and KV-cache format, ensuring compatibility with existing model structures and avoiding disruptive architectural changes.
Demerits
Complexity in Estimation
Relies on fixed-size feature-map summaries for estimating mid-region contributions, which may introduce approximation errors in dynamic or highly variable contexts.
Dependence on Anchor Tokens
Performance may degrade if anchor tokens (sink/tail) are not representative or if the precomputed summaries fail to capture nuanced attention patterns.
Limited Benchmark Diversity
The evaluation is primarily based on long-context benchmarks, leaving questions about generalizability to other domains or tasks where KV cache traffic may not be the primary bottleneck.
Expert Commentary
This paper presents a sophisticated and timely solution to a pressing challenge in the deployment of long-context generative AI models. The authors’ retrieval-completion attention module elegantly balances the need for computational efficiency with the preservation of model accuracy, addressing a critical bottleneck that has hindered the scalability of transformer-based architectures. The method’s reliance on fixed-size feature-map summaries for estimating attention contributions is a particularly clever innovation, as it avoids the need for additional KV reads while still recovering missing softmax mass. However, the approach does introduce a degree of approximation, which may become more pronounced in highly dynamic or contextually complex scenarios. Future work could explore adaptive mechanisms for refining the feature-map summaries or validating the method across a broader range of tasks to ensure robustness. Overall, this work represents a significant advancement in the field of efficient attention mechanisms and holds substantial promise for enabling more scalable and practical long-context AI systems.
Recommendations
- ✓ Further empirical validation is needed to assess the method’s robustness across diverse benchmarks and real-world applications, particularly in domains with highly variable attention patterns.
- ✓ Explore hybrid approaches that combine retrieval-completion attention with other KV-cache optimization techniques, such as quantization or pruning, to further enhance efficiency.
- ✓ Investigate the integration of this method into existing model architectures and frameworks to evaluate its practicality and adoption potential in production environments.
Sources
Original: arXiv - cs.LG