Bayesian Optimality of In-Context Learning with Selective State Spaces
arXiv:2602.17744v1 Announce Type: cross Abstract: We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistical separation from gradient descent, constructing tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be seen as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selecti
arXiv:2602.17744v1 Announce Type: cross Abstract: We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistical separation from gradient descent, constructing tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be seen as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent states more robustly than linear Transformers. This reframes ICL from "implicit optimization" to "optimal inference," explaining the efficiency of selective SSMs and offering a principled basis for architecture design.
Executive Summary
This article presents a novel approach to understanding in-context learning (ICL) by framing it as Bayesian optimal sequential prediction. The authors propose meta-learning over latent sequence tasks and demonstrate the superiority of selective State Space Models (SSMs) over gradient descent methods. The results show that selective SSMs achieve lower asymptotic risk due to superior statistical efficiency, making them a more efficient choice for ICL. The experiments confirm the effectiveness of selective SSMs in various tasks, including synthetic LG-SSM tasks and a character-level Markov benchmark. This work provides a principled basis for architecture design and reframes ICL from 'implicit optimization' to 'optimal inference'. The findings have significant implications for the development of more efficient and effective ICL models.
Key Points
- ▸ The authors propose Bayesian optimal sequential prediction as a new principle for understanding ICL.
- ▸ Selective SSMs are shown to achieve lower asymptotic risk due to superior statistical efficiency.
- ▸ The results demonstrate a statistical separation from gradient descent methods.
Merits
Strength in theoretical foundation
The article provides a rigorous theoretical framework for understanding ICL, which is a significant contribution to the field.
Empirical evidence of superiority
The experiments demonstrate the effectiveness of selective SSMs in various tasks, providing empirical evidence of their superiority over gradient descent methods.
Demerits
Limited scope of experiments
The article primarily focuses on synthetic LG-SSM tasks and a character-level Markov benchmark, which may limit the generalizability of the findings.
Complexity of selective SSMs
The selective SSMs proposed in the article may be more complex and computationally expensive than traditional gradient descent methods, which could limit their practical applicability.
Expert Commentary
The article presents a novel and rigorous approach to understanding ICL, which is a significant contribution to the field. The experiments demonstrate the effectiveness of selective SSMs in various tasks, providing empirical evidence of their superiority over gradient descent methods. However, the article's limitations, such as the limited scope of experiments and the complexity of selective SSMs, need to be addressed in future work. Overall, the article provides a principled basis for architecture design and reframes ICL from 'implicit optimization' to 'optimal inference', which has significant implications for the development of more efficient and effective ICL models.
Recommendations
- ✓ Future research should focus on expanding the scope of experiments to include more real-world tasks and datasets.
- ✓ The authors should investigate methods to simplify the complexity of selective SSMs and make them more computationally efficient.