EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models
arXiv:2603.18489v1 Announce Type: new Abstract: Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model s
arXiv:2603.18489v1 Announce Type: new Abstract: Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.
Executive Summary
This study presents EntropyCache, a novel key-value (KV) caching method designed for diffusion language models. By leveraging the maximum entropy of newly decoded token distributions as a constant-cost signal, EntropyCache achieves significant speedup (up to 26.4x) on standard benchmarks and competitive accuracy. The proposed method requires only O(V) computation per step, independent of context length and model scale. This work highlights the potential of entropy-based caching in accelerating inference for large language models, with potential applications in natural language processing and AI-related fields. The authors' empirical observations and design decisions demonstrate a deep understanding of the underlying mechanisms and challenges in KV caching for diffusion language models.
Key Points
- ▸ EntropyCache is a training-free KV caching method for diffusion language models
- ▸ The method uses decoded token entropy as a constant-cost signal for deciding when to recompute
- ▸ Experiments demonstrate significant speedup (up to 26.4x) on standard benchmarks and competitive accuracy
Merits
Strength in Empirical Evaluation
The authors provide comprehensive experiments on LLaDA-8B-Instruct and Dream-7B-Instruct, showcasing the efficacy of EntropyCache in real-world scenarios.
Innovative Use of Entropy Signal
The proposal to utilize decoded token entropy as a constant-cost signal for caching decisions is a novel and effective approach, demonstrating a deep understanding of the underlying mechanisms.
Demerits
Potential Overfitting
The method relies on empirical observations, which may not generalize to other datasets or models. Further investigation into the robustness and transferability of EntropyCache is warranted.
Scalability and Complexity
While the method demonstrates efficiency in computation, its scalability and complexity may be limited by the O(V) overhead, particularly for large models or contexts.
Expert Commentary
The study presents a well-designed and empirically validated approach to KV caching for diffusion language models. While potential limitations exist, the method demonstrates significant speedup and competitive accuracy, making it a valuable contribution to the field. The innovative use of entropy signal and comprehensive experiments showcase the authors' expertise and depth of understanding. As the field continues to evolve, the development of efficient inference methods like EntropyCache will remain crucial for unlocking the full potential of large language models.
Recommendations
- ✓ Further investigation into the robustness and transferability of EntropyCache across different datasets and models is warranted.
- ✓ The authors should explore potential applications of EntropyCache in other areas of natural language processing and AI-related fields.