ResearchGym: Evaluating Language Model Agents on Real-World AI Research
arXiv:2602.15112v1 Announce Type: new Abstract: We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource m
arXiv:2602.15112v1 Announce Type: new Abstract: We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.
Executive Summary
This article presents ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. By repurposing five oral and spotlight papers from top AI conferences, the authors create containerized task environments that challenge agents to propose novel hypotheses, run experiments, and surpass strong human baselines. The evaluation of a GPT-5 powered agent reveals a sharp capability-reliability gap, highlighting recurring long-horizon failure modes. The study demonstrates the potential for frontier agents to reach state-of-the-art performance but also underscores the need for reliable and systematic evaluation. ResearchGym provides a crucial infrastructure for the development and analysis of autonomous agents in closed-loop research.
Key Points
- ▸ ResearchGym is a benchmark and execution environment for evaluating AI agents on end-to-end research
- ▸ The evaluation of a GPT-5 powered agent reveals a sharp capability-reliability gap
- ▸ Recurring long-horizon failure modes include impatience, poor time and resource management, and overconfidence in weak hypotheses
Merits
Comprehensive Evaluation Framework
ResearchGym provides a thorough and systematic evaluation framework for AI agents, allowing for the identification of strengths and weaknesses in autonomous research.
Real-World Application
The use of real-world research papers and datasets in ResearchGym ensures that the evaluation environment is relevant and applicable to actual research scenarios.
Demerits
Limited Generalizability
The study's results may not be generalizable to all AI agents and research domains, as the evaluation was conducted on a specific set of tasks and agents.
Dependence on Human Baselines
The evaluation of AI agents relies on strong human baselines, which may not always be available or accurate, potentially affecting the reliability of the results.
Expert Commentary
The study presented in this article is a significant contribution to the field of AI research, as it highlights the need for reliable and systematic evaluation of AI agents. The use of ResearchGym as a benchmark and execution environment provides a crucial infrastructure for the development and analysis of autonomous agents in closed-loop research. However, the study's limitations, including the dependence on human baselines and limited generalizability, should be addressed in future research. The implications of this study are far-reaching, with practical implications for the development and deployment of AI agents in research settings and policy implications for the establishment of guidelines and regulations for AI research.
Recommendations
- ✓ Future research should focus on developing explainable AI agents that can provide insights into their decision-making processes.
- ✓ Establishing standards for evaluation and transparency in AI research is crucial for ensuring fairness and accountability in AI development and deployment.