Academic

A Rubric-Supervised Critic from Sparse Real-World Outcomes

arXiv:2603.03800v1 Announce Type: new Abstract: Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics imp

X
Xingyao Wang, Valerie Chen, Heng Ji, Graham Neubig
· · 1 min read · 8 views

arXiv:2603.03800v1 Announce Type: new Abstract: Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.

Executive Summary

This article proposes a novel approach to learning a 'critic' model from sparse and noisy interaction data, bridging the gap between academic benchmarks and real-world coding agents. By introducing Critic Rubrics, a rubric-based supervision framework, the authors demonstrate improved performance in reranking, early stopping, and data curation. The results show promising applications in semi-supervised learning and real-world coding agent evaluation. However, the article could benefit from further discussion on the generalizability of Critic Rubrics to diverse domains and the potential trade-offs between rubric-based supervision and other methods.

Key Points

  • The proposal of Critic Rubrics, a rubric-based supervision framework for learning a critic model.
  • The use of semi-supervised learning to jointly predict rubrics and sparse human feedback.
  • The demonstration of improved performance in reranking, early stopping, and data curation.

Merits

Strength in Semi-Supervised Learning

The use of semi-supervised learning enables the joint prediction of rubrics and sparse human feedback, leveraging the strengths of both labeled and unlabeled data.

Demerits

Limited Generalizability

The article focuses on a specific domain and may not generalize to diverse domains, highlighting the need for further exploration and validation.

Expert Commentary

This article presents a promising approach to learning a critic model from sparse and noisy interaction data, leveraging the strengths of rubric-based supervision and semi-supervised learning. While the results are encouraging, further research is needed to address the limitations of generalizability and explore potential trade-offs with other methods. The implications of this work extend beyond the coding agent domain, contributing to the broader research on real-world agent evaluation and the development of more robust and reliable evaluation methods.

Recommendations

  • Future research should investigate the application of Critic Rubrics to diverse domains and the exploration of potential trade-offs with other evaluation methods.
  • The proposed approach should be integrated with existing coding agent evaluation frameworks to enhance the efficiency and effectiveness of real-world coding agent development.

Sources