KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
arXiv:2602.19643v1 Announce Type: new Abstract: Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and
arXiv:2602.19643v1 Announce Type: new Abstract: Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.
Executive Summary
The article introduces KGHaluBench, a novel Knowledge Graph-based benchmark designed to evaluate the truthfulness of Large Language Models (LLMs) by assessing their breadth and depth of knowledge. The benchmark dynamically constructs challenging questions and employs an automated verification pipeline to detect and categorize hallucinations, providing a more comprehensive and fair evaluation of LLM performance. The study evaluates 25 frontier models using new accuracy and hallucination metrics, offering insights into the factors causing hallucinations across different model sizes. KGHaluBench is made publicly available to support future research in hallucination mitigation.
Key Points
- ▸ KGHaluBench is a Knowledge Graph-based benchmark for evaluating LLM hallucinations.
- ▸ The framework dynamically constructs challenging questions and estimates their difficulty to address popularity bias.
- ▸ An automated verification pipeline detects and verifies hallucinations at conceptual and correctness levels.
- ▸ The study evaluates 25 frontier models using novel accuracy and hallucination metrics.
- ▸ KGHaluBench is publicly available to support future developments in hallucination mitigation.
Merits
Comprehensive Evaluation
KGHaluBench provides a more thorough and fair evaluation of LLM truthfulness by assessing both the breadth and depth of their knowledge, addressing the limitations of static and narrow questions in existing benchmarks.
Dynamic Question Construction
The framework's ability to dynamically construct challenging, multifaceted questions ensures a more robust evaluation of LLMs, as it adapts to the model's knowledge base and identifies areas of potential hallucination.
Automated Verification Pipeline
The automated verification pipeline enhances the accuracy of hallucination detection by verifying responses at both conceptual and correctness levels, providing a more nuanced understanding of the types of hallucinations that occur.
Demerits
Potential Bias in Knowledge Graph
The effectiveness of KGHaluBench is contingent on the comprehensiveness and accuracy of the underlying Knowledge Graph. Biases or gaps in the Knowledge Graph could lead to incomplete or misleading evaluations of LLM performance.
Scalability and Resource Intensity
The dynamic question construction and automated verification pipeline may require significant computational resources, potentially limiting the scalability of the benchmark for widespread use or real-time applications.
Generalizability of Findings
While the study evaluates 25 frontier models, the generalizability of the findings to other LLMs or specific domains may be limited, necessitating further validation across diverse models and applications.
Expert Commentary
The introduction of KGHaluBench represents a significant advancement in the evaluation of LLM truthfulness. By leveraging a Knowledge Graph to dynamically construct challenging questions and employing an automated verification pipeline, the benchmark addresses critical limitations in existing evaluation frameworks. The study's comprehensive assessment of 25 frontier models provides valuable insights into the factors contributing to hallucinations, which is crucial for developing more robust and reliable LLMs. However, the effectiveness of KGHaluBench is contingent on the quality and comprehensiveness of the underlying Knowledge Graph, and further research is needed to validate its generalizability across diverse models and applications. The practical implications of this work are substantial, as it supports the ongoing efforts to mitigate hallucinations in LLMs and enhance their reliability. Additionally, the policy implications are noteworthy, as the insights gained can inform regulatory frameworks to ensure the responsible deployment of LLMs in critical domains.
Recommendations
- ✓ Further validation of KGHaluBench across a broader range of LLMs and specific domains to ensure the generalizability of the findings.
- ✓ Exploration of methods to enhance the scalability and efficiency of the dynamic question construction and automated verification pipeline to facilitate widespread adoption.