Skip to main content
Academic

Attention Head Entropy of LLMs Predicts Answer Correctness

arXiv:2602.13699v1 Announce Type: new Abstract: Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We further show that attention patterns over the question/co

arXiv:2602.13699v1 Announce Type: new Abstract: Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We further show that attention patterns over the question/context alone, before answer generation, already carry predictive signal using Head Entropy with on average +17.7% AUROC over the closest baseline. We evaluate across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.

Executive Summary

The article 'Attention Head Entropy of LLMs Predicts Answer Correctness' introduces a novel method called Head Entropy to predict the correctness of answers generated by large language models (LLMs). By analyzing attention entropy patterns, specifically the spread of the attention mass, the study demonstrates that Head Entropy can predict answer correctness more effectively than existing baselines, both in-distribution and out-of-domain. The method is evaluated across five instruction-tuned LLMs and three question-answering datasets, showing significant improvements in predictive performance. The study also highlights that attention patterns over the question and context alone, before answer generation, carry predictive signals, further enhancing the method's utility.

Key Points

  • Introduction of Head Entropy as a method to predict answer correctness in LLMs.
  • Head Entropy outperforms baselines in both in-distribution and out-of-domain settings.
  • Attention patterns over the question and context alone carry predictive signals.
  • Evaluation across five LLMs and three QA datasets demonstrates robustness and generalizability.

Merits

Innovative Methodology

The introduction of Head Entropy as a method to predict answer correctness is a significant advancement in the field of LLM evaluation. It provides a novel approach to understanding and improving the reliability of LLM outputs.

Strong Empirical Evidence

The study provides robust empirical evidence supporting the effectiveness of Head Entropy. The method's performance is thoroughly evaluated across multiple datasets and models, demonstrating its generalizability and robustness.

Practical Applications

The findings have practical applications in safety-critical settings such as medicine, where the accuracy of LLM outputs is paramount. The ability to predict answer correctness can enhance the reliability and safety of LLM applications.

Demerits

Limited Scope of Evaluation

While the study evaluates Head Entropy across multiple datasets and models, the scope of evaluation is still limited. Further research is needed to assess its performance across a broader range of datasets and models to ensure its generalizability.

Potential Computational Overhead

The method involves analyzing attention entropy patterns, which may introduce computational overhead. The practical feasibility of implementing Head Entropy in real-world applications needs to be further explored.

Dependence on Model Internals

The method relies on accessing model internals, which may not always be feasible or desirable. The study acknowledges this limitation and suggests that further research is needed to address this issue.

Expert Commentary

The article presents a significant advancement in the field of LLM evaluation by introducing Head Entropy as a method to predict answer correctness. The study's rigorous empirical evaluation across multiple datasets and models demonstrates the method's robustness and generalizability. The findings have important implications for safety-critical applications of LLMs, such as in medicine, where the accuracy of outputs is paramount. However, the study also acknowledges several limitations, including the potential computational overhead and dependence on model internals. These limitations highlight the need for further research to address these issues and ensure the practical feasibility of the method. Overall, the article makes a valuable contribution to the field and provides important insights for both researchers and practitioners.

Recommendations

  • Further research should explore the generalizability of Head Entropy across a broader range of datasets and models to ensure its robustness and reliability.
  • Future studies should investigate the computational overhead associated with Head Entropy and develop methods to mitigate any potential performance impacts.

Sources