Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias
arXiv:2603.10123v1 Announce Type: new Abstract: The ``Lost in the Middle'' phenomenon -- a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle -- is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emph{the U-shape is already present at initialization, before any training or positional encoding takes effect.} It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Ces\`{a}ro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated $\mathcal{O}(1)$ anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of or
arXiv:2603.10123v1 Announce Type: new Abstract: The ``Lost in the Middle'' phenomenon -- a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle -- is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emph{the U-shape is already present at initialization, before any training or positional encoding takes effect.} It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Ces\`{a}ro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated $\mathcal{O}(1)$ anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of order $\mathcal{O}(1/(H{-}1)!)$, where $H$ is the network depth, making middle-context retrieval and training structurally hostile. We validate empirically that untrained Qwen2 and GPT-2 architectures exhibit this U-shape at Step~0, and that it is identical with or without RoPE. Comparing initialized and pretrained networks, we show that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives. We do not claim that this bias is insurmountable, nor that interventions such as RoPE modifications are useless. We establish what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.
Executive Summary
This article presents a groundbreaking theory that the 'Lost in the Middle' phenomenon in large language models (LLMs) is an inherent property of the causal decoder with residual connections, present from initialization. The authors derive an exact closed-form influence density in the continuous limit, demonstrating a U-shaped performance curve due to a logarithmic divergence of gradient influence and an isolated anchor at the final token. This theory challenges the prevailing view that the phenomenon is caused by learned Softmax artifacts or positional encoding. The authors' empirical validation and comparison of initialized and pretrained networks demonstrate the persistence of the U-shape as an architectural baseline under standard pretraining objectives. The theory provides a precise target for future efforts to overcome this bias.
Key Points
- ▸ The 'Lost in the Middle' phenomenon is an inherent property of the causal decoder with residual connections, present from initialization.
- ▸ The U-shaped performance curve is caused by a logarithmic divergence of gradient influence and an isolated anchor at the final token.
- ▸ The theory challenges the prevailing view that the phenomenon is caused by learned Softmax artifacts or positional encoding.
Merits
Strength
Provides a precise and mathematically rigorous explanation for the 'Lost in the Middle' phenomenon, challenging the prevailing view and offering a new direction for research.
Empirical validity
Empirically validates the theory through experiments on untrained Qwen2 and GPT-2 architectures, demonstrating the persistence of the U-shape as an architectural baseline under standard pretraining objectives.
Demerits
Limitation
The theory may be complex and difficult to understand for non-experts in the field, potentially limiting its accessibility and adoption.
Assumptions
The theory assumes a specific type of network architecture (causal decoder with residual connections) and may not generalize to other architectures.
Expert Commentary
This article represents a significant contribution to the field of natural language processing and deep learning. The authors have provided a precise and mathematically rigorous explanation for the 'Lost in the Middle' phenomenon, challenging the prevailing view and offering a new direction for research. The empirical validation and comparison of initialized and pretrained networks demonstrate the persistence of the U-shape as an architectural baseline under standard pretraining objectives. The theory has significant implications for the development of more effective pretraining objectives and training methods for LLMs, potentially leading to improved performance and reduced bias in LLMs. However, the theory may be complex and difficult to understand for non-experts in the field, and the assumptions made by the theory may limit its generalizability.
Recommendations
- ✓ Recommendation 1: LLM developers should consider incorporating the theory into their design and training methods to improve the performance and reduce the bias of their models.
- ✓ Recommendation 2: Future research should focus on developing new pretraining objectives and training methods that take into account the inherent properties of the causal decoder with residual connections.