Academic

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

arXiv:2603.03319v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both val

J
James Wedgwood, Chhavi Yadav, Virginia Smith
· · 1 min read · 13 views

arXiv:2603.03319v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across both general and domain-specific datasets, including biases toward responses that emphasize concreteness and empathy in approaching new situations, toward detail and formality in academic advice, and against legal guidance that promotes active steps like calling police and filing lawsuits. Our results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies.

Executive Summary

This article introduces a novel automated concept discovery framework to identify previously unknown drivers of Large Language Model (LLM) preference judgments as a judge. Rather than relying on predefined bias taxonomies, the authors apply embedding-level concept extraction methods—particularly sparse autoencoder-based approaches—to analyze LLM behavior across 27k paired human-LLM preference datasets. The method uncovers interpretable preference features that predict LLM decisions with competitive accuracy, revealing new biases: preference for concreteness and empathy in general responses, detail and formality in academic advice, and avoidance of legally active guidance (e.g., calling police, filing lawsuits). Importantly, the approach validates prior findings (e.g., LLMs’ higher refusal rates in sensitive requests) while expanding the scope of bias detection beyond prior manual hypotheses. This represents a significant methodological advancement in LLM evaluation.

Key Points

  • Automated concept discovery replaces predefined bias taxonomies with algorithmic extraction of preference features.
  • Sparse autoencoder-based methods outperform alternatives in interpretability without sacrificing predictive accuracy.
  • New biases identified include domain-specific preferences for empathy, concreteness, formality, and avoidance of activist legal advice.

Merits

Interpretability

The sparse autoencoder approach yields more interpretable features than alternative embedding-based methods, enhancing transparency in bias identification.

Scalability

Application to over 27k paired responses demonstrates feasibility at scale, enabling automated, systematic bias analysis without manual hypothesis generation.

Demerits

Generalizability Constraint

Results are derived from specific datasets and LLMs; applicability to broader or divergent LLM architectures or use cases remains unvalidated.

Black-Box Risk

While features are interpretable, the underlying autoencoder architecture introduces potential opacity in the causal link between extracted concepts and LLM decision patterns.

Expert Commentary

This paper marks a pivotal evolution in the evaluation of LLMs as adjudicators. Traditionally, bias identification in LLM judgments has been constrained by the limitations of human-defined bias taxonomies, which are inherently incomplete and subject to confirmation bias. The authors’ shift toward algorithmic, concept-level discovery—via sparse autoencoders—represents a paradigm shift. Their work demonstrates that machine learning can uncover latent preference structures in human-AI comparative data that elude human intuition. Moreover, the alignment between interpretability and predictive power is rare in bias detection literature; this balance is critical for adoption in real-world systems. The identification of domain-specific biases—such as the avoidance of legal action in academic contexts—highlights the nuanced nature of LLM decision-making beyond surface-level patterns. This methodology, if validated across diverse LLM models and domains, could become a standard tool in AI governance. It also raises important questions about the ethical implications of automated bias discovery: Who owns the discovered concepts? Can they be contested? And how do we ensure these discoveries are not weaponized? The paper opens a vital dialogue on the next generation of AI evaluation frameworks.

Recommendations

  • Develop open-source tools for sparse autoencoder-based concept extraction to facilitate replication and scaling across LLM platforms.
  • Establish interdisciplinary working groups to validate discovered biases against legal, ethical, and sociocultural benchmarks.

Sources