Skip to main content
Academic

Interleaved Head Attention

arXiv:2602.21371v1 Announce Type: new Abstract: Multi-Head Attention (MHA) is the core computational primitive underlying modern Large Language Models (LLMs). However, MHA suffers from a fundamental linear scaling limitation: $H$ attention heads produce exactly $H$ independent attention matrices, with no communication between heads during attention computation. This becomes problematic for multi-step reasoning, where correct answers depend on aggregating evidence from multiple parts of the context and composing latent token-to-token relations over a chain of intermediate inferences. To address this, we propose Interleaved Head Attention (IHA), which enables cross-head mixing by constructing $P$ pseudo-heads per head (typically $P=H$), where each pseudo query/key/value is a learned linear combination of all $H$ original queries, keys and values respectively. Interactions between pseudo-query and pseudo-key heads induce up to $P^2$ attention patterns per head with modest parameter overh

arXiv:2602.21371v1 Announce Type: new Abstract: Multi-Head Attention (MHA) is the core computational primitive underlying modern Large Language Models (LLMs). However, MHA suffers from a fundamental linear scaling limitation: $H$ attention heads produce exactly $H$ independent attention matrices, with no communication between heads during attention computation. This becomes problematic for multi-step reasoning, where correct answers depend on aggregating evidence from multiple parts of the context and composing latent token-to-token relations over a chain of intermediate inferences. To address this, we propose Interleaved Head Attention (IHA), which enables cross-head mixing by constructing $P$ pseudo-heads per head (typically $P=H$), where each pseudo query/key/value is a learned linear combination of all $H$ original queries, keys and values respectively. Interactions between pseudo-query and pseudo-key heads induce up to $P^2$ attention patterns per head with modest parameter overhead $\mathcal{O}(H^2P)$. We provide theory showing improved efficiency in terms of number of parameters on the synthetic Polynomial task (IHA uses $\Theta(\sqrt{k}n^2)$ parameters vs. $\Theta(kn^2)$ for MHA) and on the synthetic order-sensitive CPM-3 task (IHA uses $\lceil\sqrt{N_{\max}}\rceil$ heads vs. $N_{\max}$ for MHA). On real-world benchmarks, IHA improves Multi-Key retrieval on RULER by 10-20% (4k-16k) and, after fine-tuning for reasoning on OpenThoughts, improves GSM8K by 5.8% and MATH-500 by 2.8% (Majority Vote) over full attention.

Executive Summary

This article proposes Interleaved Head Attention (IHA), a novel approach to address the linear scaling limitation of Multi-Head Attention (MHA) in Large Language Models (LLMs). IHA enables cross-head mixing by constructing pseudo-heads per head, inducing up to P^2 attention patterns with modest parameter overhead. Theoretical analysis shows improved efficiency on synthetic tasks, and empirical results demonstrate significant improvements on real-world benchmarks. This study has the potential to contribute to the development of more efficient and effective LLMs.

Key Points

  • IHA addresses the linear scaling limitation of MHA by enabling cross-head mixing through pseudo-heads.
  • IHA induces up to P^2 attention patterns with modest parameter overhead.
  • Theoretical analysis shows improved efficiency on synthetic tasks.
  • Empirical results demonstrate significant improvements on real-world benchmarks.

Merits

Addressing a fundamental limitation of MHA

IHA provides a novel solution to the linear scaling limitation of MHA, enabling more efficient and effective attention mechanisms in LLMs.

Improved efficiency on synthetic tasks

Theoretical analysis demonstrates that IHA achieves improved efficiency on synthetic tasks, such as the Polynomial and CPM-3 tasks.

Empirical results demonstrate real-world benefits

Empirical results show significant improvements on real-world benchmarks, such as Multi-Key retrieval on RULER and reasoning tasks on OpenThoughts.

Demerits

Parameter overhead

IHA requires a modest parameter overhead of O(H^2P), which may be a concern for larger models or limited computational resources.

Limited experimental comparison

The study primarily focuses on IHA's benefits, with limited comparison to other attention mechanisms or experiments evaluating IHA's robustness.

Expert Commentary

IHA is a promising approach to addressing the linear scaling limitation of MHA, and its benefits are well-documented in the empirical results. However, further research is needed to fully evaluate IHA's robustness and scalability. Additionally, the parameter overhead of IHA may be a concern for larger models or limited computational resources. Nevertheless, IHA's contribution to the development of more efficient and effective LLMs is significant, and it has the potential to impact a range of NLP tasks and applications.

Recommendations

  • Future research should focus on evaluating IHA's robustness and scalability, as well as comparing it to other attention mechanisms.
  • Developers and practitioners should consider the parameter overhead of IHA when deploying it in real-world applications.

Sources