A Rubric-Supervised Critic from Sparse Real-World Outcomes
arXiv:2603.03800v1 Announce Type: new Abstract: Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics imp
arXiv:2603.03800v1 Announce Type: new Abstract: Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.
Executive Summary
This article proposes a novel approach to learning a 'critic' model from sparse and noisy interaction data, bridging the gap between academic benchmarks and real-world coding agents. By introducing Critic Rubrics, a rubric-based supervision framework, the authors demonstrate improved performance in reranking, early stopping, and data curation. The results show promising applications in semi-supervised learning and real-world coding agent evaluation. However, the article could benefit from further discussion on the generalizability of Critic Rubrics to diverse domains and the potential trade-offs between rubric-based supervision and other methods.
Key Points
- ▸ The proposal of Critic Rubrics, a rubric-based supervision framework for learning a critic model.
- ▸ The use of semi-supervised learning to jointly predict rubrics and sparse human feedback.
- ▸ The demonstration of improved performance in reranking, early stopping, and data curation.
Merits
Strength in Semi-Supervised Learning
The use of semi-supervised learning enables the joint prediction of rubrics and sparse human feedback, leveraging the strengths of both labeled and unlabeled data.
Demerits
Limited Generalizability
The article focuses on a specific domain and may not generalize to diverse domains, highlighting the need for further exploration and validation.
Expert Commentary
This article presents a promising approach to learning a critic model from sparse and noisy interaction data, leveraging the strengths of rubric-based supervision and semi-supervised learning. While the results are encouraging, further research is needed to address the limitations of generalizability and explore potential trade-offs with other methods. The implications of this work extend beyond the coding agent domain, contributing to the broader research on real-world agent evaluation and the development of more robust and reliable evaluation methods.
Recommendations
- ✓ Future research should investigate the application of Critic Rubrics to diverse domains and the exploration of potential trade-offs with other evaluation methods.
- ✓ The proposed approach should be integrated with existing coding agent evaluation frameworks to enhance the efficiency and effectiveness of real-world coding agent development.