Academic

A Residual-Aware Theory of Position Bias in Transformers

arXiv:2602.16837v1 Announce Type: new Abstract: Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. Under causal masking at infinite depth, prior theoretical analyses of attention rollout predict an inevitable collapse of attention onto the first token. Such collapse, however, does not occur in practice. We resolve this discrepancy with a residual-aware theory of cumulative attention rollout. By incorporating residual connections, we show that this architectural component prevents collapse under realistic conditions. At finite depth, we prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens. This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, S\"oren Laue · February 21, 2026 · 1 min read · 5 views

#cs.LG

Executive Summary

A Residual-Aware Theory of Position Bias in Transformers presents a novel explanation for the position bias phenomenon in transformer models. The authors propose a residual-aware theory of cumulative attention rollout, which accounts for the role of residual connections in preventing attention collapse under realistic conditions. The theory predicts a U-shaped position bias, with attention concentrating on early and late tokens, providing a principled architectural explanation for the Lost-in-the-Middle phenomenon. This work contributes to the understanding of transformer models and has implications for their practical application in natural language processing tasks.

Key Points

▸ Transformer models exhibit position bias, favoring certain token positions.
▸ Residual connections prevent attention collapse under realistic conditions.
▸ Causal transformers induce a U-shaped position bias at finite depth.

Merits

Strength

Provides a principled architectural explanation for the Lost-in-the-Middle phenomenon, which has significant implications for transformer model design and application.

Strength

Introduces a residual-aware theory of cumulative attention rollout, which offers a comprehensive understanding of the role of residual connections in transformer models.

Demerits

Limitation

The theory relies on a simplified assumption of infinite depth, which may not accurately reflect real-world transformer model architectures.

Limitation

The authors do not provide empirical evidence to support their theoretical predictions, which may limit the theory's applicability and generalizability.

Expert Commentary

The article presents a significant contribution to the understanding of transformer models, shedding light on the architectural origins of position bias. The residual-aware theory of cumulative attention rollout provides a comprehensive explanation for the phenomenon, and its implications for transformer model design are substantial. However, the work's reliance on simplified assumptions and the lack of empirical evidence to support theoretical predictions may limit its applicability and generalizability. Nevertheless, this research has the potential to shape the future of transformer model development and application in NLP tasks.

Recommendations

✓ Future research should focus on empirical validation of the residual-aware theory, using real-world transformer model architectures and datasets.
✓ Developers and researchers should consider the implications of position bias for transformer model design and application, adjusting architectures and training procedures as necessary to mitigate the phenomenon.

Sources

arXiv - cs.LG

Something extraordinary is coming.

A Residual-Aware Theory of Position Bias in Transformers

AI Commentary

Executive Summary

Key Points

Merits

Strength

Strength

Demerits

Limitation

Limitation

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.