Academic

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

arXiv:2603.06594v1 Announce Type: new Abstract: Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting jud

Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, Stephan G\"unnemann · March 10, 2026 · 1 min read · 82 views

#cs.CL #cs.AI

Executive Summary

This article presents a thorough critique of the reliability of Large Language Model (LLM) judges in evaluating adversarial robustness. Through a comprehensive audit, the authors reveal that LLM judges often fail to reliably measure harmfulness, leading to degraded performance to near random chance. The study highlights the limitations of relying on LLM judges for safety evaluation, particularly in the presence of diverse victim models, distorted output patterns, and semantic ambiguity. To address these shortcomings, the authors propose ReliableBench, a benchmark for more consistently judgeable behaviors, and JudgeStressTest, a dataset designed to expose judge failures. The findings have significant implications for the development and deployment of LLMs in safety-critical applications.

Key Points

▸ LLM judges often fail to reliably measure adversarial robustness
▸ Existing validation protocols fail to account for distribution shifts inherent to red-teaming
▸ Judge performance degrades to near random chance in the presence of diverse victim models and semantic ambiguity

Merits

Methodological rigor

The study employs a comprehensive audit with 6642 human-verified labels, providing a robust evaluation of LLM judge performance.

Practical significance

The findings have significant implications for the development and deployment of LLMs in safety-critical applications.

Demerits

Limited scope

The study focuses on the reliability of LLM judges in evaluating adversarial robustness, which may not be directly applicable to other AI applications.

Dependence on human labels

The study relies on human-verified labels, which may introduce biases and inaccuracies in the evaluation process.

Expert Commentary

The study provides a thorough critique of the reliability of LLM judges in evaluating adversarial robustness, highlighting the limitations of existing validation protocols and the need for more robust and transparent benchmarks. The findings have significant implications for the development and deployment of LLMs in safety-critical applications. However, the study's reliance on human labels and limited scope may introduce biases and inaccuracies in the evaluation process. Nevertheless, the proposed benchmarks and datasets, such as ReliableBench and JudgeStressTest, offer valuable insights for researchers and developers looking to improve the reliability and transparency of LLMs. As AI continues to play an increasingly important role in safety-critical applications, it is essential to prioritize robustness, transparency, and explainability in AI model development and deployment.

Recommendations

✓ Develop and deploy LLMs using more robust and transparent benchmarks to ensure reliability in safety-critical applications.
✓ Establish regulatory standards for the development and deployment of LLMs in safety-critical applications, including requirements for robustness and transparency.

Sources

arXiv - cs.CL

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

AI Commentary

Executive Summary

Key Points

Merits

Methodological rigor

Practical significance

Demerits

Limited scope

Dependence on human labels

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs