Academic

Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias

Borun D Chowdhury · March 12, 2026 · 1 min read · 32 views

#cs.LG #cs.AI #cs.CL

arXiv:2603.10123v1 Announce Type: new Abstract: The ``Lost in the Middle'' phenomenon -- a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle -- is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emph{the U-shape is already present at initialization, before any training or positional encoding takes effect.} It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Ces\`{a}ro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated $\mathcal{O}(1)$ anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of order $\mathcal{O}(1/(H{-}1)!)$, where $H$ is the network depth, making middle-context retrieval and training structurally hostile. We validate empirically that untrained Qwen2 and GPT-2 architectures exhibit this U-shape at Step~0, and that it is identical with or without RoPE. Comparing initialized and pretrained networks, we show that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives. We do not claim that this bias is insurmountable, nor that interventions such as RoPE modifications are useless. We establish what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.

Executive Summary

This article presents a groundbreaking theory that the 'Lost in the Middle' phenomenon in large language models (LLMs) is an inherent property of the causal decoder with residual connections, present from initialization. The authors derive an exact closed-form influence density in the continuous limit, demonstrating a U-shaped performance curve due to a logarithmic divergence of gradient influence and an isolated anchor at the final token. This theory challenges the prevailing view that the phenomenon is caused by learned Softmax artifacts or positional encoding. The authors' empirical validation and comparison of initialized and pretrained networks demonstrate the persistence of the U-shape as an architectural baseline under standard pretraining objectives. The theory provides a precise target for future efforts to overcome this bias.

Key Points

▸ The 'Lost in the Middle' phenomenon is an inherent property of the causal decoder with residual connections, present from initialization.
▸ The U-shaped performance curve is caused by a logarithmic divergence of gradient influence and an isolated anchor at the final token.
▸ The theory challenges the prevailing view that the phenomenon is caused by learned Softmax artifacts or positional encoding.

Merits

Strength

Provides a precise and mathematically rigorous explanation for the 'Lost in the Middle' phenomenon, challenging the prevailing view and offering a new direction for research.

Empirical validity

Empirically validates the theory through experiments on untrained Qwen2 and GPT-2 architectures, demonstrating the persistence of the U-shape as an architectural baseline under standard pretraining objectives.

Demerits

Limitation

The theory may be complex and difficult to understand for non-experts in the field, potentially limiting its accessibility and adoption.

Assumptions

The theory assumes a specific type of network architecture (causal decoder with residual connections) and may not generalize to other architectures.

Expert Commentary

This article represents a significant contribution to the field of natural language processing and deep learning. The authors have provided a precise and mathematically rigorous explanation for the 'Lost in the Middle' phenomenon, challenging the prevailing view and offering a new direction for research. The empirical validation and comparison of initialized and pretrained networks demonstrate the persistence of the U-shape as an architectural baseline under standard pretraining objectives. The theory has significant implications for the development of more effective pretraining objectives and training methods for LLMs, potentially leading to improved performance and reduced bias in LLMs. However, the theory may be complex and difficult to understand for non-experts in the field, and the assumptions made by the theory may limit its generalizability.

Recommendations

✓ Recommendation 1: LLM developers should consider incorporating the theory into their design and training methods to improve the performance and reduce the bias of their models.
✓ Recommendation 2: Future research should focus on developing new pretraining objectives and training methods that take into account the inherent properties of the causal decoder with residual connections.

Sources

arXiv - cs.LG

Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias

AI Commentary

Executive Summary

Key Points

Merits

Strength

Empirical validity

Demerits

Limitation

Assumptions

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs