Think, But Don't Overthink: Reproducing Recursive Language Models
arXiv:2603.02615v1 Announce Type: new Abstract: This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates
arXiv:2603.02615v1 Announce Type: new Abstract: This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction
Executive Summary
This study reproduces and extends the Recursive Language Models (RLMs) framework by investigating the impact of scaling recursion depth on Large Language Models (LLMs). While RLMs with a recursion depth of 1 boost accuracy on complex reasoning tasks, deeper recursion (depth=2) paradoxically degrades performance and inflates execution time and token costs. The findings suggest that LLMs may 'overthink' when given excessive recursion, highlighting the need for careful tuning of RLM parameters. The study's results have significant implications for the development and deployment of LLMs in various applications.
Key Points
- ▸ RLMs with a recursion depth of 1 improve accuracy on complex reasoning tasks
- ▸ Deeper recursion (depth=2) degrades performance and inflates execution time and token costs
- ▸ LLMs may 'overthink' when given excessive recursion
Merits
Strength in Replication
The study's ability to replicate and extend the RLM framework by Zhang et al. (2026) demonstrates a high level of scientific rigor and attention to detail.
Insights into LLM Behavior
The findings provide valuable insights into the behavior of LLMs under different recursion depths, shedding light on the potential risks of over-recursion.
Demerits
Limited Generalizability
The study's findings may not generalize to all LLM architectures and tasks, highlighting the need for further research to explore the robustness of the RLM framework.
Lack of Theoretical Foundations
The study's focus on empirical evaluation leaves open questions regarding the theoretical foundations of RLMs and their optimal recursion depth.
Expert Commentary
The study's findings have significant implications for the development and deployment of LLMs. While RLMs with a recursion depth of 1 show promise for improving accuracy on complex reasoning tasks, the risks of over-recursion highlight the need for careful tuning of RLM parameters. The study's results also raise important questions regarding the theoretical foundations of RLMs and their optimal recursion depth. Further research is needed to explore the robustness of the RLM framework and to develop more explainable and transparent LLMs.
Recommendations
- ✓ Future studies should investigate the robustness of the RLM framework across different LLM architectures and tasks.
- ✓ Researchers should explore alternative techniques to mitigate the risks of over-recursion in LLMs.