A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
arXiv:2603.06594v1 Announce Type: new Abstract: Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting jud
arXiv:2603.06594v1 Announce Type: new Abstract: Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.
Executive Summary
This article presents a thorough critique of the reliability of Large Language Model (LLM) judges in evaluating adversarial robustness. Through a comprehensive audit, the authors reveal that LLM judges often fail to reliably measure harmfulness, leading to degraded performance to near random chance. The study highlights the limitations of relying on LLM judges for safety evaluation, particularly in the presence of diverse victim models, distorted output patterns, and semantic ambiguity. To address these shortcomings, the authors propose ReliableBench, a benchmark for more consistently judgeable behaviors, and JudgeStressTest, a dataset designed to expose judge failures. The findings have significant implications for the development and deployment of LLMs in safety-critical applications.
Key Points
- ▸ LLM judges often fail to reliably measure adversarial robustness
- ▸ Existing validation protocols fail to account for distribution shifts inherent to red-teaming
- ▸ Judge performance degrades to near random chance in the presence of diverse victim models and semantic ambiguity
Merits
Methodological rigor
The study employs a comprehensive audit with 6642 human-verified labels, providing a robust evaluation of LLM judge performance.
Practical significance
The findings have significant implications for the development and deployment of LLMs in safety-critical applications.
Demerits
Limited scope
The study focuses on the reliability of LLM judges in evaluating adversarial robustness, which may not be directly applicable to other AI applications.
Dependence on human labels
The study relies on human-verified labels, which may introduce biases and inaccuracies in the evaluation process.
Expert Commentary
The study provides a thorough critique of the reliability of LLM judges in evaluating adversarial robustness, highlighting the limitations of existing validation protocols and the need for more robust and transparent benchmarks. The findings have significant implications for the development and deployment of LLMs in safety-critical applications. However, the study's reliance on human labels and limited scope may introduce biases and inaccuracies in the evaluation process. Nevertheless, the proposed benchmarks and datasets, such as ReliableBench and JudgeStressTest, offer valuable insights for researchers and developers looking to improve the reliability and transparency of LLMs. As AI continues to play an increasingly important role in safety-critical applications, it is essential to prioritize robustness, transparency, and explainability in AI model development and deployment.
Recommendations
- ✓ Develop and deploy LLMs using more robust and transparent benchmarks to ensure reliability in safety-critical applications.
- ✓ Establish regulatory standards for the development and deployment of LLMs in safety-critical applications, including requirements for robustness and transparency.