Academic

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

arXiv:2603.12201v1 Announce Type: new Abstract: Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this

arXiv:2603.12201v1 Announce Type: new Abstract: Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

Executive Summary

The article introduces IndexCache as a novel optimization for sparse attention in large language models, addressing inefficiencies in cross-layer index redundancy. While DeepSeek Sparse Attention (DSA) reduces attention complexity from O(L²) to O(Lk) via top-k selection, its indexer still incurs O(L²) cost per layer, despite high similarity across layers. IndexCache mitigates this by introducing a hybrid architecture: Full layers retain independent indexers, while Shared layers reuse indices from adjacent Full layers, significantly reducing computational overhead. Two complementary strategies—training-free (greedy calibration) and training-aware (multi-layer distillation)—are proposed to optimize layer partitioning. Empirical results on a 30B DSA model demonstrate up to 75% indexer computation reduction with minimal quality impact, achieving 1.82× prefill and 1.48× decode speedups. These findings are validated on GLM-5, indicating broad applicability.

Key Points

  • IndexCache reduces indexer overhead by reusing cross-layer indices
  • Two complementary strategies (greedy and distillation) optimize layer partitioning
  • Empirical gains validated on both DSA and GLM-5 models

Merits

Computational Efficiency

IndexCache achieves substantial reduction in indexer computations (75%) without compromising output quality, directly improving inference speed and cost efficiency.

Scalability

The hybrid architecture scales effectively across large models by leveraging redundancy patterns, making it suitable for production-scale deployments.

Demerits

Implementation Complexity

Configuring the hybrid layer partitioning requires careful tuning of Full/Shared layer boundaries, which may introduce overhead during model adaptation or deployment.

Generalizability Constraint

Results are based on specific sparse attention variants (DSA, GLM-5); applicability to other architectures or attention mechanisms remains unconfirmed.

Expert Commentary

IndexCache represents a sophisticated yet practical solution to a persistent inefficiency in sparse attention architectures. The insight that cross-layer index similarity can be exploited via shared caching is elegant and aligns with broader trends in computational efficiency research. The dual-strategy approach—combining empirical calibration with distillation-based training—demonstrates a nuanced understanding of both computational and learning dynamics. Notably, the ability to achieve near-optimal accuracy with minimal loss perturbation suggests that these techniques can be generalized beyond sparse attention to other modularized components in transformer architectures. This work elevates the discourse around efficiency engineering in large-scale AI systems, offering a replicable framework for future optimizations. The empirical validation across multiple models adds credibility and suggests IndexCache is not a one-off tweak but a potentially transformative design pattern.

Recommendations

  • Researchers should evaluate IndexCache applicability across diverse sparse attention variants and transformer architectures.
  • Industry teams deploying sparse attention models should consider integrating IndexCache as a standard optimization layer in inference pipelines.

Sources