Academic

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

arXiv:2603.00039v1 Announce Type: new Abstract: LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and

arXiv:2603.00039v1 Announce Type: new Abstract: LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks spanning continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8\%. Code is released in \href{https://github.com/SprocketLab/CARE}{https://github.com/SprocketLab/CARE}.

Executive Summary

The article introduces CARE, a confounder-aware aggregation framework for evaluating Large Language Models (LLMs). CARE addresses the limitation of standard aggregation mechanisms by modeling LLM judge scores as a combination of latent true-quality signals and shared confounding factors. The framework provides theoretical guarantees for identifiability and finite-sample recovery, and demonstrates improved aggregation accuracy across 12 public benchmarks, reducing error by up to 26.8%. The authors release the code for CARE, making it accessible for further research and applications.

Key Points

  • CARE is a confounder-aware aggregation framework for LLM evaluation
  • The framework models LLM judge scores as a combination of latent true-quality signals and shared confounding factors
  • CARE provides theoretical guarantees for identifiability and finite-sample recovery

Merits

Improved Aggregation Accuracy

CARE demonstrates improved aggregation accuracy across various benchmarks, reducing error by up to 26.8%

Theoretical Guarantees

The framework provides theoretical guarantees for identifiability and finite-sample recovery, ensuring robustness and reliability

Demerits

Limited Ground-Truth Labels

CARE separates quality from confounders without access to ground-truth labels, which may limit its applicability in certain scenarios

Expert Commentary

The introduction of CARE marks a significant advancement in LLM evaluation, as it acknowledges the limitations of standard aggregation mechanisms and provides a more nuanced approach to modeling judge scores. By explicitly accounting for shared confounding factors, CARE can help mitigate systematic biases and improve the overall reliability of AI decision-making. However, further research is needed to explore the applicability of CARE in diverse scenarios and to address potential limitations, such as the lack of ground-truth labels.

Recommendations

  • Further research should investigate the applicability of CARE in various domains and scenarios
  • The development of CARE should be accompanied by efforts to establish clear evaluation protocols and standards for LLMs

Sources