When to Think Fast and Slow? AMOR: Entropy-Based Metacognitive Gate for Dynamic SSM-Attention Switching
arXiv:2602.13215v1 Announce Type: new Abstract: Transformers allocate uniform computation to every position, regardless of difficulty. State Space Models (SSMs) offer efficient alternatives but struggle with precise information retrieval over a long horizon. Inspired by dual-process theories of cognition (Kahneman, 2011), we propose AMOR (Adaptive Metacognitive Output Router), a hybrid architecture that dynamically engages sparse attention only when an SSM backbone is "uncertain"--as measured by prediction entropy. Compared to standard transformers, AMOR gains efficiency by projecting keys and values from SSM hidden states (Ghost KV), reusing the SSM's O(n) computation rather than requiring O(n^2) attention at every layer. On small-scale synthetic retrieval tasks, AMOR outperforms both SSM-only and transformer-only baselines, achieving perfect retrieval accuracy while engaging attention on only 22% of positions. We validate that prediction entropy reliably signals retrieval need, with
arXiv:2602.13215v1 Announce Type: new Abstract: Transformers allocate uniform computation to every position, regardless of difficulty. State Space Models (SSMs) offer efficient alternatives but struggle with precise information retrieval over a long horizon. Inspired by dual-process theories of cognition (Kahneman, 2011), we propose AMOR (Adaptive Metacognitive Output Router), a hybrid architecture that dynamically engages sparse attention only when an SSM backbone is "uncertain"--as measured by prediction entropy. Compared to standard transformers, AMOR gains efficiency by projecting keys and values from SSM hidden states (Ghost KV), reusing the SSM's O(n) computation rather than requiring O(n^2) attention at every layer. On small-scale synthetic retrieval tasks, AMOR outperforms both SSM-only and transformer-only baselines, achieving perfect retrieval accuracy while engaging attention on only 22% of positions. We validate that prediction entropy reliably signals retrieval need, with a gap of 1.09 nats (nearly half the entropy range) between retrieval and local positions. Additionally, our approach provides interpretable adaptive computation, where routing decisions can be understood in information-theoretic terms.
Executive Summary
The article 'When to Think Fast and Slow? AMOR: Entropy-Based Metacognitive Gate for Dynamic SSM-Attention Switching' introduces AMOR, a hybrid architecture that dynamically switches between State Space Models (SSMs) and sparse attention mechanisms based on prediction entropy. Inspired by dual-process theories of cognition, AMOR aims to improve computational efficiency and accuracy in information retrieval tasks. The study demonstrates that AMOR outperforms both SSM-only and transformer-only baselines by engaging attention only when the SSM backbone is uncertain, as measured by prediction entropy. This approach not only enhances efficiency but also provides interpretable adaptive computation, making it a promising advancement in the field of machine learning and cognitive modeling.
Key Points
- ▸ AMOR dynamically switches between SSMs and sparse attention based on prediction entropy.
- ▸ AMOR achieves perfect retrieval accuracy on small-scale synthetic tasks while using attention on only 22% of positions.
- ▸ Prediction entropy reliably signals the need for retrieval, with a significant gap between retrieval and local positions.
Merits
Efficiency
AMOR significantly reduces computational overhead by engaging attention only when necessary, leveraging the O(n) computation of SSMs instead of the O(n^2) attention required by transformers.
Accuracy
AMOR achieves perfect retrieval accuracy on synthetic tasks, demonstrating its effectiveness in information retrieval.
Interpretability
The adaptive computation in AMOR is interpretable in information-theoretic terms, providing a clear understanding of routing decisions.
Demerits
Limited Scope
The study is limited to small-scale synthetic retrieval tasks, and its performance on larger, more complex datasets remains untested.
Complexity
The hybrid architecture of AMOR introduces additional complexity, which may pose challenges in implementation and scalability.
Entropy Measurement
The reliability of prediction entropy as a sole metric for determining the need for attention may be questioned, as it might not capture all nuances of uncertainty.
Expert Commentary
The article presents a novel and innovative approach to improving computational efficiency and accuracy in information retrieval tasks. By leveraging dual-process theories of cognition, AMOR dynamically switches between SSMs and sparse attention mechanisms based on prediction entropy. This not only enhances efficiency but also provides interpretable adaptive computation. The study's findings are promising, particularly the achievement of perfect retrieval accuracy on synthetic tasks while using attention on only 22% of positions. However, the study's limitations, such as its focus on small-scale synthetic tasks and the complexity of the hybrid architecture, should be addressed in future research. Additionally, the reliability of prediction entropy as a sole metric for determining the need for attention warrants further investigation. Overall, AMOR represents a significant advancement in the field of machine learning and cognitive modeling, with potential applications in various domains where efficient and accurate information retrieval is crucial.
Recommendations
- ✓ Future research should validate AMOR's performance on larger, more complex datasets to assess its scalability and robustness.
- ✓ Further studies should explore the use of additional metrics beyond prediction entropy to better capture the nuances of uncertainty in information retrieval tasks.