Academic

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

arXiv:2603.00039v1 Announce Type: new Abstract: LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and

Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala · March 4, 2026 · 1 min read · 42 views

#cs.LG #cs.AI #stat.ML

Executive Summary

The article introduces CARE, a confounder-aware aggregation framework for evaluating Large Language Models (LLMs). CARE addresses the limitation of standard aggregation mechanisms by modeling LLM judge scores as a combination of latent true-quality signals and shared confounding factors. The framework provides theoretical guarantees for identifiability and finite-sample recovery, and demonstrates improved aggregation accuracy across 12 public benchmarks, reducing error by up to 26.8%. The authors release the code for CARE, making it accessible for further research and applications.

Key Points

▸ CARE is a confounder-aware aggregation framework for LLM evaluation
▸ The framework models LLM judge scores as a combination of latent true-quality signals and shared confounding factors
▸ CARE provides theoretical guarantees for identifiability and finite-sample recovery

Merits

Improved Aggregation Accuracy

CARE demonstrates improved aggregation accuracy across various benchmarks, reducing error by up to 26.8%

Theoretical Guarantees

The framework provides theoretical guarantees for identifiability and finite-sample recovery, ensuring robustness and reliability

Demerits

Limited Ground-Truth Labels

CARE separates quality from confounders without access to ground-truth labels, which may limit its applicability in certain scenarios

Expert Commentary

The introduction of CARE marks a significant advancement in LLM evaluation, as it acknowledges the limitations of standard aggregation mechanisms and provides a more nuanced approach to modeling judge scores. By explicitly accounting for shared confounding factors, CARE can help mitigate systematic biases and improve the overall reliability of AI decision-making. However, further research is needed to explore the applicability of CARE in diverse scenarios and to address potential limitations, such as the lack of ground-truth labels.

Recommendations

✓ Further research should investigate the applicability of CARE in various domains and scenarios
✓ The development of CARE should be accompanied by efforts to establish clear evaluation protocols and standards for LLMs

Sources

arXiv - cs.LG

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

AI Commentary

Executive Summary

Key Points

Merits

Improved Aggregation Accuracy

Theoretical Guarantees

Demerits

Limited Ground-Truth Labels

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs