Academic

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

arXiv:2604.03270v1 Announce Type: new Abstract: RAG wastes tokens. We propose Knowledge Packs: pre-computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce - this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6-7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels - knowledge and steering -

A
Andrey Pustovit
· · 1 min read · 8 views

arXiv:2604.03270v1 Announce Type: new Abstract: RAG wastes tokens. We propose Knowledge Packs: pre-computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce - this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6-7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels - knowledge and steering - run simultaneously at alpha<=0.7 without interference. No training, no weight modification.

Executive Summary

This article proposes Knowledge Packs, pre-computed KV caches that deliver knowledge at zero token cost, leveraging the causal mask of causal transformers. The authors demonstrate zero divergences in knowledge delivery and up to 95% token savings on Qwen3-8B and Llama-3.1-8B models. The KV interface also enables behavioral steering, a capability not present in Retrieval-Augmented Generative (RAG) models. The method does not require training or weight modification, and its effectiveness is demonstrated on 700 questions. The authors emphasize the importance of correct chat template formatting, as incorrect formatting can cause significant degradation in performance.

Key Points

  • Knowledge Packs deliver knowledge at zero token cost using pre-computed KV caches.
  • The method leverages the causal mask of causal transformers for efficient knowledge delivery.
  • The KV interface enables behavioral steering, a capability not present in RAG models.

Merits

Strength in Efficiency

Knowledge Packs offer significant token savings, up to 95%, without compromising knowledge delivery.

Scalability

The method does not require training or weight modification, making it easily scalable for large models.

Flexibility

The KV interface enables behavioral steering, allowing for fine-grained control over model behavior.

Demerits

Sensitivity to Formatting

Incorrect chat template formatting can cause significant degradation in performance, highlighting the importance of careful formatting.

Potential for Overfitting

The pre-computation of KV caches may lead to overfitting, particularly if the training data is not representative of the broader population.

Expert Commentary

The article presents a compelling case for Knowledge Packs as a viable solution for efficient knowledge delivery and behavioral steering. However, the sensitivity to formatting and potential for overfitting are important considerations that require careful attention. The method's scalability and flexibility are significant strengths, and its implications for practical applications and policy decisions are substantial. Further research is needed to fully explore the potential of Knowledge Packs and address the limitations identified in this study.

Recommendations

  • Further investigation into the sensitivity to formatting and potential for overfitting is necessary to fully understand the method's limitations.
  • The development of more robust methods for chat template formatting and pre-computation of KV caches is recommended to mitigate these limitations.

Sources

Original: arXiv - cs.CL