Academic

KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

arXiv:2602.19643v1 Announce Type: new Abstract: Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and

Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan · February 25, 2026 · 1 min read · 3 views

#cs.CL

Executive Summary

The article introduces KGHaluBench, a novel Knowledge Graph-based benchmark designed to evaluate the truthfulness of Large Language Models (LLMs) by assessing their breadth and depth of knowledge. The benchmark dynamically constructs challenging questions and employs an automated verification pipeline to detect and categorize hallucinations, providing a more comprehensive and fair evaluation of LLM performance. The study evaluates 25 frontier models using new accuracy and hallucination metrics, offering insights into the factors causing hallucinations across different model sizes. KGHaluBench is made publicly available to support future research in hallucination mitigation.

Key Points

▸ KGHaluBench is a Knowledge Graph-based benchmark for evaluating LLM hallucinations.
▸ The framework dynamically constructs challenging questions and estimates their difficulty to address popularity bias.
▸ An automated verification pipeline detects and verifies hallucinations at conceptual and correctness levels.
▸ The study evaluates 25 frontier models using novel accuracy and hallucination metrics.
▸ KGHaluBench is publicly available to support future developments in hallucination mitigation.

Merits

Comprehensive Evaluation

KGHaluBench provides a more thorough and fair evaluation of LLM truthfulness by assessing both the breadth and depth of their knowledge, addressing the limitations of static and narrow questions in existing benchmarks.

Dynamic Question Construction

The framework's ability to dynamically construct challenging, multifaceted questions ensures a more robust evaluation of LLMs, as it adapts to the model's knowledge base and identifies areas of potential hallucination.

Automated Verification Pipeline

The automated verification pipeline enhances the accuracy of hallucination detection by verifying responses at both conceptual and correctness levels, providing a more nuanced understanding of the types of hallucinations that occur.

Demerits

Potential Bias in Knowledge Graph

The effectiveness of KGHaluBench is contingent on the comprehensiveness and accuracy of the underlying Knowledge Graph. Biases or gaps in the Knowledge Graph could lead to incomplete or misleading evaluations of LLM performance.

Scalability and Resource Intensity

The dynamic question construction and automated verification pipeline may require significant computational resources, potentially limiting the scalability of the benchmark for widespread use or real-time applications.

Generalizability of Findings

While the study evaluates 25 frontier models, the generalizability of the findings to other LLMs or specific domains may be limited, necessitating further validation across diverse models and applications.

Expert Commentary

The introduction of KGHaluBench represents a significant advancement in the evaluation of LLM truthfulness. By leveraging a Knowledge Graph to dynamically construct challenging questions and employing an automated verification pipeline, the benchmark addresses critical limitations in existing evaluation frameworks. The study's comprehensive assessment of 25 frontier models provides valuable insights into the factors contributing to hallucinations, which is crucial for developing more robust and reliable LLMs. However, the effectiveness of KGHaluBench is contingent on the quality and comprehensiveness of the underlying Knowledge Graph, and further research is needed to validate its generalizability across diverse models and applications. The practical implications of this work are substantial, as it supports the ongoing efforts to mitigate hallucinations in LLMs and enhance their reliability. Additionally, the policy implications are noteworthy, as the insights gained can inform regulatory frameworks to ensure the responsible deployment of LLMs in critical domains.

Recommendations

✓ Further validation of KGHaluBench across a broader range of LLMs and specific domains to ensure the generalizability of the findings.
✓ Exploration of methods to enhance the scalability and efficiency of the dynamic question construction and automated verification pipeline to facilitate widespread adoption.

Sources

arXiv - cs.CL

Something extraordinary is coming.

KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation

Dynamic Question Construction

Automated Verification Pipeline

Demerits

Potential Bias in Knowledge Graph

Scalability and Resource Intensity

Generalizability of Findings

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.