Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
arXiv:2603.04427v1 Announce Type: new Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values ($d_q = d_k = d_v = \dmodel$). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (\emph{selection}), while values carry rich semantic representations (\emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than value transfer, requiring only $\BigO(\log N)$ dimensions to distinguish among $N$ relevant patterns. We validate this hypothesis across seven experiments: (1)~positional selection tasks requiring just 1~dimension per head, (2)~content-based retrieval requiring $\sim\!\log_2 N$ dimensions, (3--4)~WikiText-2 and WikiText-103 language modeling where $\dselect = \dmodel/4$ incurs only 4.3\% perplexity increase while reducing QK parameters by 75\%, (5)~post-training SVD compression of GPT-2, revealing
arXiv:2603.04427v1 Announce Type: new Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values ($d_q = d_k = d_v = \dmodel$). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (\emph{selection}), while values carry rich semantic representations (\emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than value transfer, requiring only $\BigO(\log N)$ dimensions to distinguish among $N$ relevant patterns. We validate this hypothesis across seven experiments: (1)~positional selection tasks requiring just 1~dimension per head, (2)~content-based retrieval requiring $\sim\!\log_2 N$ dimensions, (3--4)~WikiText-2 and WikiText-103 language modeling where $\dselect = \dmodel/4$ incurs only 4.3\% perplexity increase while reducing QK parameters by 75\%, (5)~post-training SVD compression of GPT-2, revealing keys to be far more compressible than queries, with lightweight QK fine-tuning recovering nearly all quality loss, (6)~a 125M-parameter LLaMA model confirming identical degradation ratios across architectures, and (7)~Mistral-7B (7.2B parameters), where SVD compression followed by QK fine-tuning achieves 75\% key cache savings at just 2.0\% residual quality cost. For existing models, SVD compression followed by QK fine-tuning (3 epochs on a small fraction of pretraining data) achieves 75\% key cache savings at $<$2\% residual quality cost. For a 7B-parameter model serving 128K context, asymmetric attention saves 25\,GB of KV cache per user, enabling approximately 60\% more concurrent users on the same GPU.
Executive Summary
This article proposes a novel approach to reducing the computational cost of key-value (KV) cache in transformer-based models by exploiting the asymmetry between query (Q), key (K), and value (V) components. The authors argue that Q and K serve fundamentally different roles, with Q requiring only log-N dimensions for selection, while V carries rich semantic representations. This insight is validated across seven experiments, demonstrating significant reductions in KV cache while incurring minimal quality loss. The approach has far-reaching implications for large-scale language models, enabling up to 75% key cache savings at less than 2% residual quality cost, and can potentially support 60% more concurrent users on the same GPU.
Key Points
- ▸ The proposed approach exploits the asymmetry between Q, K, and V components in transformer models.
- ▸ Q and K require only log-N dimensions for selection, while V carries rich semantic representations.
- ▸ The approach achieves significant reductions in KV cache while incurring minimal quality loss.
Merits
Strength in Mathematical Foundation
The article is grounded in a solid mathematical foundation, leveraging insights from information theory and attention mechanisms to derive the proposed approach.
Experimental Validation
The authors provide comprehensive experimental validation across seven tasks, demonstrating the efficacy of the proposed approach in reducing KV cache without compromising model quality.
Scalability and Practicality
The approach is scalable and practical, enabling up to 75% key cache savings at less than 2% residual quality cost, making it a promising solution for large-scale language models.
Demerits
Potential Overfitting
The authors' approach relies on fine-tuning QK parameters, which may lead to overfitting, particularly when the model is small and the training dataset is limited.
Dependence on Model Architecture
The approach may not generalize to all model architectures, particularly those that deviate significantly from the standard transformer design.
Computational Overhead
While the approach reduces KV cache, it may introduce additional computational overhead due to the need for QK fine-tuning and SVD compression.
Expert Commentary
The article presents a compelling approach to reducing KV cache in transformer-based models, leveraging the asymmetry between Q, K, and V components. While the approach is grounded in solid mathematical foundations and extensively validated across seven tasks, it is not without limitations. The potential for overfitting and dependence on model architecture are significant concerns that need to be addressed. Nevertheless, the article's findings have far-reaching implications for the development of more efficient and scalable language models, making it a valuable contribution to the field.
Recommendations
- ✓ Further investigation into the potential for overfitting and its mitigation strategies is necessary to ensure the robustness of the proposed approach.
- ✓ The authors should explore the applicability of the approach to other model architectures and tasks to generalize its benefits.