Academic

Hypothesis Class Determines Explanation: Why Accurate Models Disagree on Feature Attribution

arXiv:2603.15821v1 Announce Type: new Abstract: The assumption that prediction-equivalent models produce equivalent explanations underlies many practices in explainable AI, including model selection, auditing, and regulatory evaluation. In this work, we show that this assumption does not hold. Through a large-scale empirical study across 24 datasets and multiple model classes, we find that models with identical predictive behavior can produce substantially different feature attributions. This disagreement is highly structured: models within the same hypothesis class exhibit strong agreement, while cross-class pairs (e.g., tree-based vs. linear) trained on identical data splits show substantially reduced agreement, consistently near or below the lottery threshold. We identify hypothesis class as the structural driver of this phenomenon, which we term the Explanation Lottery. We theoretically show that the resulting Agreement Gap persists under interaction structure in the data-generati

T
Thackshanaramana B
· · 1 min read · 8 views

arXiv:2603.15821v1 Announce Type: new Abstract: The assumption that prediction-equivalent models produce equivalent explanations underlies many practices in explainable AI, including model selection, auditing, and regulatory evaluation. In this work, we show that this assumption does not hold. Through a large-scale empirical study across 24 datasets and multiple model classes, we find that models with identical predictive behavior can produce substantially different feature attributions. This disagreement is highly structured: models within the same hypothesis class exhibit strong agreement, while cross-class pairs (e.g., tree-based vs. linear) trained on identical data splits show substantially reduced agreement, consistently near or below the lottery threshold. We identify hypothesis class as the structural driver of this phenomenon, which we term the Explanation Lottery. We theoretically show that the resulting Agreement Gap persists under interaction structure in the data-generating process. This structural finding motivates a post-hoc diagnostic, the Explanation Reliability Score R(x), which predicts when explanations are stable across architectures without additional training. Our results demonstrate that model selection is not explanation-neutral: the hypothesis class chosen for deployment can determine which features are attributed responsibility for a decision.

Executive Summary

This article challenges the conventional assumption in explainable AI that prediction-equivalent models produce equivalent explanations. Through a large-scale empirical study, the authors demonstrate that models within the same hypothesis class exhibit strong agreement in feature attributions, while models from different hypothesis classes, such as tree-based and linear, show substantially reduced agreement. The authors term this phenomenon the Explanation Lottery and identify hypothesis class as the structural driver. They also develop a post-hoc diagnostic, the Explanation Reliability Score R(x), to predict when explanations are stable across architectures. The study highlights the importance of considering hypothesis class in model selection and evaluation, as it can determine which features are attributed responsibility for a decision.

Key Points

  • The assumption that prediction-equivalent models produce equivalent explanations is challenged
  • Models within the same hypothesis class exhibit strong agreement in feature attributions
  • Hypothesis class is identified as the structural driver of the Explanation Lottery
  • A post-hoc diagnostic, the Explanation Reliability Score R(x), is developed to predict explanation stability
  • Hypothesis class affects model selection and evaluation

Merits

Strength of Empirical Study

The authors conduct a large-scale empirical study across 24 datasets and multiple model classes, providing robust evidence for the Explanation Lottery.

Theoretical Insights

The authors provide theoretical insights into the persistence of the Agreement Gap under interaction structure in the data-generating process.

Post-hoc Diagnostic Development

The authors develop a post-hoc diagnostic, the Explanation Reliability Score R(x), to predict when explanations are stable across architectures.

Demerits

Limited Generalizability

The study's focus on a specific set of datasets and model classes may limit the generalizability of the findings to other domains and applications.

Need for Further Investigation

The Explanation Lottery phenomenon requires further investigation to fully understand its implications and potential mitigation strategies.

Expert Commentary

The article presents a significant challenge to the conventional approach in explainable AI, highlighting the importance of considering hypothesis class in model selection and evaluation. The authors' empirical study and theoretical insights provide a robust foundation for understanding the Explanation Lottery phenomenon. However, the study's limitations and the need for further investigation underscore the complexity of this issue. The development of post-hoc diagnostics, such as the Explanation Reliability Score R(x), offers a promising direction for mitigating the effects of the Explanation Lottery. Ultimately, this study highlights the need for a more nuanced understanding of model interpretability and the potential consequences of hypothesis class selection.

Recommendations

  • Future research should investigate the Explanation Lottery phenomenon in different domains and applications to assess its generalizability.
  • Developing more advanced post-hoc diagnostics and mitigation strategies is essential to address the potential consequences of hypothesis class selection.

Sources