When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
arXiv:2603.05659v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) and Rubrics as Rewards (RaR) have driven strong gains in domains with clear correctness signals and even in subjective domains by synthesizing evaluation criteria from ideal reference answers. But many real-world tasks admit multiple valid outputs and lack the single ideal answer that rubric generation depends on. We identify this reference-free setting as a gap in current post-training methods and propose Implicit Error Counting (IEC) to fill it. Instead of checking what a response gets right against a rubric, IEC enumerates what it gets wrong, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. We show that na\"ive explicit enumeration is too noisy for stable optimization, and that two design choices: implicit score emission and group calibration are necessary to make error counting a reliable reward. As a case study,
arXiv:2603.05659v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) and Rubrics as Rewards (RaR) have driven strong gains in domains with clear correctness signals and even in subjective domains by synthesizing evaluation criteria from ideal reference answers. But many real-world tasks admit multiple valid outputs and lack the single ideal answer that rubric generation depends on. We identify this reference-free setting as a gap in current post-training methods and propose Implicit Error Counting (IEC) to fill it. Instead of checking what a response gets right against a rubric, IEC enumerates what it gets wrong, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. We show that na\"ive explicit enumeration is too noisy for stable optimization, and that two design choices: implicit score emission and group calibration are necessary to make error counting a reliable reward. As a case study, we validate IEC on virtual try-on (VTO), a domain that is simultaneously too constrained for holistic scoring and too permissive for rubric-based evaluation: subtle garment errors are unacceptable, yet many output variations are correct. We introduce Cascaded Error Counting (CEC) as an evaluation metric, which tracks human preferences well (60% top-1 vs. 30% others), and curate Mismatch-DressCode (MDressBench), a benchmark with maximal attribute mismatch to stress-test reward designs. On MDressBench, IEC outperforms RaR across all metrics (CEC: 5.31 vs. 5.60 on flat references; 5.20 vs. 5.53 on non-flat). On VITON-HD and DressCode, IEC matches or surpasses six baselines on 6 of 8 perceptual metrics. These results suggest that when ideal answers are unavailable, counting errors provide a stronger signal than constructing rubrics.
Executive Summary
The article proposes Implicit Error Counting (IEC) as a post-training method for reference-free reinforcement learning, where ideal answers are unavailable. IEC enumerates errors in responses and converts them into calibrated rewards, outperforming traditional rubric-based methods in virtual try-on tasks. The approach is validated through a case study and benchmarking, demonstrating its effectiveness in domains with multiple valid outputs and subtle errors.
Key Points
- ▸ IEC is a novel approach to reference-free reinforcement learning
- ▸ Error enumeration is used to generate rewards instead of traditional rubric-based methods
- ▸ IEC outperforms traditional methods in virtual try-on tasks with multiple valid outputs
Merits
Effective in reference-free settings
IEC can handle tasks with multiple valid outputs and no ideal answer
Improved performance
IEC outperforms traditional rubric-based methods in virtual try-on tasks
Demerits
Limited to specific domains
IEC may not be applicable to all domains, particularly those with clear correctness signals
Requires careful calibration
IEC requires careful calibration of error counting and reward generation
Expert Commentary
The article presents a significant contribution to the field of reinforcement learning, particularly in reference-free settings. The proposed IEC approach demonstrates improved performance in virtual try-on tasks, highlighting its potential for applications in domains with multiple valid outputs. However, further research is needed to fully explore the limitations and potential applications of IEC. The article's thorough evaluation and benchmarking provide a solid foundation for future work in this area.
Recommendations
- ✓ Further research is needed to explore the applicability of IEC to other domains
- ✓ Careful calibration of error counting and reward generation is crucial for effective implementation of IEC