Academic

The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

arXiv:2603.19664v1 Announce Type: new Abstract: The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-mem

arXiv:2603.19664v1 Announce Type: new Abstract: The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.

Executive Summary

This article challenges the conventional wisdom that key-value (KV) caches are essential in transformer inference. By demonstrating that the residual stream is sufficient to reconstruct KV entries, the authors show that the KV cache is redundant. The study verifies this finding across six models from four architecture families and proposes a new bounded-memory inference scheme, KV-Direct, which recomputes KV entries on demand. The results show significant memory and latency savings over traditional caching approaches. This work has important implications for the development of efficient transformer-based models and highlights the potential for novel inference strategies.

Key Points

  • The residual stream is sufficient to reconstruct KV entries, making the KV cache redundant.
  • KV-Direct, a new bounded-memory inference scheme, recomputes KV entries on demand and offers significant memory and latency savings.
  • This work challenges conventional wisdom on the role of KV caches in transformer inference and highlights the potential for novel inference strategies.

Merits

Strength in Theory

The study provides a rigorous theoretical framework for understanding the relationship between the residual stream and KV entries, demonstrating that the former is sufficient to reconstruct the latter.

Practical Impact

The authors propose a novel inference scheme, KV-Direct, which offers significant memory and latency savings over traditional caching approaches, making it a practical and impactful contribution to the field.

Demerits

Limitation in Generalizability

The study is limited to a specific set of transformer models and may not generalize to other architectures or applications.

Implementation Complexity

The proposed KV-Direct scheme may require significant implementation complexity and additional computational resources to recomputing KV entries on demand.

Expert Commentary

The study provides a significant contribution to the field of transformer-based models, challenging conventional wisdom on the role of KV caches. The proposed KV-Direct scheme offers a promising alternative to traditional caching approaches, but its implementation complexity and generalizability to other architectures or applications remain areas of concern. The study's findings have important implications for the development of efficient transformer-based models and highlight the need for continued research into novel inference strategies.

Recommendations

  • Future research should focus on extending the study's findings to other transformer architectures and applications to better understand the generalizability of the proposed KV-Direct scheme.
  • Developers and practitioners should consider the potential benefits of adopting the proposed KV-Direct scheme in their applications, particularly in scenarios where memory and latency are critical constraints.

Sources

Original: arXiv - cs.LG