Academic

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

arXiv:2603.04738v1 Announce Type: new Abstract: Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model ali

arXiv:2603.04738v1 Announce Type: new Abstract: Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

Executive Summary

This article proposes IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following in large language models. The benchmark assesses judge models' ability to rank multiple responses based on instruction-following quality, addressing the shortcomings of existing benchmarks. The authors conduct extensive experiments on IF-RewardBench, revealing significant deficiencies in current judge models and demonstrating its stronger correlation with downstream task performance. This work has significant implications for improving the reliability of judge models in instruction-following and their alignment with model optimization scenarios.

Key Points

  • IF-RewardBench is a comprehensive meta-evaluation benchmark for instruction-following
  • The benchmark assesses judge models' ability to rank multiple responses based on instruction-following quality
  • Existing benchmarks are limited by insufficient data coverage and oversimplified evaluation paradigms

Merits

Strength in evaluation paradigm

IF-RewardBench's listwise evaluation paradigm more accurately reflects model optimization scenarios, leading to improved model alignment.

Comprehensive data coverage

The benchmark covers diverse instruction and constraint types, providing a more robust assessment of judge models.

Demerits

Limited generalizability

The benchmark's performance on diverse instruction and constraint types may not generalize well to real-world scenarios.

Computational intensity

Constructing preference graphs for each instruction can be computationally intensive, requiring significant resources.

Expert Commentary

The authors make a significant contribution to the field of large language model evaluation by proposing a comprehensive meta-evaluation benchmark that addresses the shortcomings of existing benchmarks. The design of IF-RewardBench, with its listwise evaluation paradigm and diverse instruction and constraint types, provides a more accurate assessment of judge models' capabilities. However, the computational intensity of constructing preference graphs and the limited generalizability of the benchmark are notable limitations. Nevertheless, the implications of IF-RewardBench are far-reaching, with potential applications in model alignment, LLM evaluation, and policy-making.

Recommendations

  • Future research should focus on developing more efficient algorithms for constructing preference graphs and improving the generalizability of IF-RewardBench.
  • The development of more comprehensive evaluation frameworks for LLMs, informed by IF-RewardBench, is essential for advancing the field of natural language processing.

Sources