Skip to main content
Academic

Fast KV Compaction via Attention Matching

arXiv:2602.16284v1 Announce Type: new Abstract: Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly pu

A
Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim
· · 1 min read · 4 views

arXiv:2602.16284v1 Announce Type: new Abstract: Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

Executive Summary

The article proposes a novel approach for fast context compaction in latent space through Attention Matching, which enables the construction of compact keys and values to reproduce attention outputs and preserve attention mass. This method significantly improves the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds with minimal quality loss. The approach has the potential to overcome the limitations of traditional summarization methods, which can be highly lossy and harm downstream performance. By leveraging Attention Matching, the authors demonstrate a family of methods that can efficiently compact key-value caches in latent space, making it a promising solution for scaling language models to long contexts.

Key Points

  • Attention Matching for fast context compaction in latent space
  • Compact keys and values to reproduce attention outputs and preserve attention mass
  • Up to 50x compaction in seconds with minimal quality loss

Merits

Efficient Compaction

The proposed method achieves significant compaction in a short amount of time, making it a promising solution for deploying language models in real-world applications.

Demerits

Limited Generalizability

The approach may not generalize well to all types of language models or datasets, and further research is needed to explore its applicability in different contexts.

Expert Commentary

The proposed Attention Matching approach represents a significant advancement in the field of natural language processing, as it enables fast and efficient compaction of key-value caches in latent space. The method's ability to preserve attention mass and reproduce attention outputs makes it a promising solution for scaling language models to long contexts. However, further research is needed to explore the approach's generalizability and applicability in different contexts. The implications of this work are substantial, as it has the potential to improve language model performance in real-world applications and inform policy decisions in areas such as education and communication.

Recommendations

  • Further research on the generalizability and applicability of the Attention Matching approach
  • Exploration of the method's potential applications in areas such as question answering and text summarization

Sources