Academic

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

arXiv:2604.03742v1 Announce Type: new Abstract: Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose \textbf{DualJudge}, a hybrid framework inspired by Dual-Process Theor

arXiv:2604.03742v1 Announce Type: new Abstract: Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose \textbf{DualJudge}, a hybrid framework inspired by Dual-Process Theory that adaptively fuses holistic direct scores with structured AHP outputs via consistency-aware weighting. DualJudge achieves state-of-the-art performance, underscoring the complementary strengths of intuitive and deliberative evaluation paradigms. These results establish uncertainty-aware structured reasoning as a principled pathway toward more reliable LLM assessment. Code is available at https://github.com/hreyulog/AHP_llm_judge.

Executive Summary

This seminal work addresses the persistent challenge of inconsistent and opaque evaluations of large language models (LLMs) by introducing a structured, confidence-aware framework that integrates the Analytic Hierarchy Process (AHP) with fuzzy logic. The authors propose a dual innovation: first, a Fuzzy AHP (FAHP) extension that models epistemic uncertainty using triangular fuzzy numbers modulated by LLM-generated confidence scores, and second, a hybrid DualJudge framework that adaptively fuses direct scoring with structured AHP outputs via consistency-aware weighting. Validated on JudgeBench, their approach demonstrates superior calibration, stability, and performance over direct scoring across model scales and dataset splits. The methodological rigor and empirical validation underscore the potential of uncertainty-aware structured reasoning to elevate the reliability of LLM assessments, offering a principled alternative to conventional direct scoring paradigms.

Key Points

  • Integration of AHP with fuzzy logic to model epistemic uncertainty in LLM evaluations, enhancing calibration and transparency.
  • Proposal of DualJudge, a hybrid framework that synergistically combines intuitive direct scoring with deliberative structured reasoning.
  • Empirical validation on JudgeBench demonstrates consistent outperformance of direct scoring methods, with FAHP exhibiting superior stability in uncertain comparison scenarios.
  • Methodological innovation lies in the adaptive fusion of confidence-aware fuzzy aggregation with hierarchical decomposition of evaluation criteria.
  • Open-source implementation (GitHub) facilitates reproducibility and adoption in broader LLM evaluation ecosystems.

Merits

Methodological Rigor

The authors demonstrate a sophisticated fusion of multi-criteria decision analysis (MCDA) techniques with LLM-generated confidence metrics, producing a structured yet flexible evaluation framework that addresses longstanding issues of opacity and inconsistency in LLM assessments.

Empirical Robustness

Extensive experiments across model scales and dataset splits validate the superiority of both crisp and fuzzy AHP over direct scoring, with FAHP particularly excelling in unstable comparison scenarios, highlighting its practical reliability.

Theoretical Contribution

The introduction of DualJudge, inspired by Dual-Process Theory, bridges intuitive and deliberative evaluation paradigms, offering a novel lens for LLM assessment that aligns with cognitive science principles.

Reproducibility and Openness

The release of code on GitHub ensures transparency and fosters community engagement, accelerating the adoption and refinement of the proposed methodologies.

Demerits

Dependence on Confidence Metrics

The FAHP framework's performance hinges on the quality of LLM-generated confidence scores, which may introduce bias or noise if the underlying model's calibration is suboptimal, potentially undermining the framework's reliability.

Scope of Validation

While validated on JudgeBench, the generalizability of the approach to other evaluation benchmarks or domains (e.g., multimodal LLMs, non-English languages) remains an open question, warranting further empirical testing.

Computational Overhead

The structured and fuzzy nature of the proposed methods may introduce significant computational overhead compared to direct scoring, posing challenges for real-time or large-scale evaluations where efficiency is critical.

Subjectivity in Criteria Weighting

Although AHP reduces subjectivity through pairwise comparisons, the initial assignment of criteria weights may still reflect human biases, particularly in domains requiring domain-specific expertise.

Expert Commentary

The authors present a compelling and timely contribution to the field of LLM evaluation, addressing a critical bottleneck in the development and deployment of these systems. By integrating AHP with fuzzy logic and confidence-aware metrics, they offer a structured approach that not only improves the calibration of evaluations but also introduces a novel hybrid framework—DualJudge—that bridges intuitive and deliberative paradigms. The empirical validation on JudgeBench is robust and persuasive, demonstrating clear advantages over direct scoring methods. However, the framework's reliance on LLM-generated confidence scores introduces a potential vulnerability, as inaccuracies in these scores could propagate through the evaluation process. Furthermore, while the theoretical foundations are sound, the practical implications of computational overhead and the subjectivity in criteria weighting merit careful consideration. The interdisciplinary nature of this work, drawing from MCDA and cognitive science, sets a high standard for future research in AI evaluation. This paper is likely to become a foundational reference for scholars and practitioners seeking to enhance the reliability and transparency of LLM assessments.

Recommendations

  • Future research should explore the robustness of the FAHP and DualJudge frameworks across diverse benchmarks and domains, including multimodal LLMs and non-English languages, to validate their generalizability.
  • Investigate methods to optimize the computational efficiency of the proposed approaches, particularly for real-time or large-scale evaluations, without compromising the integrity of the evaluation process.
  • Develop standardized protocols for selecting and weighting evaluation criteria in AHP-based frameworks to minimize subjectivity and enhance reproducibility, potentially leveraging consensus-based methods or expert elicitation techniques.
  • Explore the integration of additional uncertainty quantification techniques, such as Bayesian methods or evidential reasoning, to further enhance the calibration and interpretability of LLM evaluations.

Sources

Original: arXiv - cs.AI