Academic

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Hengshuai Yao, Guan Wang · March 7, 2026 · 1 min read · 2 views

#cs.LG #cs.AI

arXiv:2603.04427v1 Announce Type: new Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values ($d_q = d_k = d_v = \dmodel$). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (\emph{selection}), while values carry rich semantic representations (\emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than value transfer, requiring only $\BigO(\log N)$ dimensions to distinguish among $N$ relevant patterns. We validate this hypothesis across seven experiments: (1)~positional selection tasks requiring just 1~dimension per head, (2)~content-based retrieval requiring $\sim\!\log_2 N$ dimensions, (3--4)~WikiText-2 and WikiText-103 language modeling where $\dselect = \dmodel/4$ incurs only 4.3\% perplexity increase while reducing QK parameters by 75\%, (5)~post-training SVD compression of GPT-2, revealing keys to be far more compressible than queries, with lightweight QK fine-tuning recovering nearly all quality loss, (6)~a 125M-parameter LLaMA model confirming identical degradation ratios across architectures, and (7)~Mistral-7B (7.2B parameters), where SVD compression followed by QK fine-tuning achieves 75\% key cache savings at just 2.0\% residual quality cost. For existing models, SVD compression followed by QK fine-tuning (3 epochs on a small fraction of pretraining data) achieves 75\% key cache savings at $<$2\% residual quality cost. For a 7B-parameter model serving 128K context, asymmetric attention saves 25\,GB of KV cache per user, enabling approximately 60\% more concurrent users on the same GPU.

Executive Summary

This article proposes a novel approach to reducing the computational cost of key-value (KV) cache in transformer-based models by exploiting the asymmetry between query (Q), key (K), and value (V) components. The authors argue that Q and K serve fundamentally different roles, with Q requiring only log-N dimensions for selection, while V carries rich semantic representations. This insight is validated across seven experiments, demonstrating significant reductions in KV cache while incurring minimal quality loss. The approach has far-reaching implications for large-scale language models, enabling up to 75% key cache savings at less than 2% residual quality cost, and can potentially support 60% more concurrent users on the same GPU.

Key Points

▸ The proposed approach exploits the asymmetry between Q, K, and V components in transformer models.
▸ Q and K require only log-N dimensions for selection, while V carries rich semantic representations.
▸ The approach achieves significant reductions in KV cache while incurring minimal quality loss.

Merits

Strength in Mathematical Foundation

The article is grounded in a solid mathematical foundation, leveraging insights from information theory and attention mechanisms to derive the proposed approach.

Experimental Validation

The authors provide comprehensive experimental validation across seven tasks, demonstrating the efficacy of the proposed approach in reducing KV cache without compromising model quality.

Scalability and Practicality

The approach is scalable and practical, enabling up to 75% key cache savings at less than 2% residual quality cost, making it a promising solution for large-scale language models.

Demerits

Potential Overfitting

The authors' approach relies on fine-tuning QK parameters, which may lead to overfitting, particularly when the model is small and the training dataset is limited.

Dependence on Model Architecture

The approach may not generalize to all model architectures, particularly those that deviate significantly from the standard transformer design.

Computational Overhead

While the approach reduces KV cache, it may introduce additional computational overhead due to the need for QK fine-tuning and SVD compression.

Expert Commentary

The article presents a compelling approach to reducing KV cache in transformer-based models, leveraging the asymmetry between Q, K, and V components. While the approach is grounded in solid mathematical foundations and extensively validated across seven tasks, it is not without limitations. The potential for overfitting and dependence on model architecture are significant concerns that need to be addressed. Nevertheless, the article's findings have far-reaching implications for the development of more efficient and scalable language models, making it a valuable contribution to the field.

Recommendations

✓ Further investigation into the potential for overfitting and its mitigation strategies is necessary to ensure the robustness of the proposed approach.
✓ The authors should explore the applicability of the approach to other model architectures and tasks to generalize its benefits.

Sources

arXiv - cs.LG

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

AI Commentary

Executive Summary

Key Points

Merits

Strength in Mathematical Foundation

Experimental Validation

Scalability and Practicality

Demerits

Potential Overfitting

Dependence on Model Architecture

Computational Overhead

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs