Hypothesis Class Determines Explanation: Why Accurate Models Disagree on Feature Attribution
arXiv:2603.15821v1 Announce Type: new Abstract: The assumption that prediction-equivalent models produce equivalent explanations underlies many practices in explainable AI, including model selection, auditing, and regulatory evaluation. In this work, we show that this assumption does not hold. Through a large-scale empirical study across 24 datasets and multiple model classes, we find that models with identical predictive behavior can produce substantially different feature attributions. This disagreement is highly structured: models within the same hypothesis class exhibit strong agreement, while cross-class pairs (e.g., tree-based vs. linear) trained on identical data splits show substantially reduced agreement, consistently near or below the lottery threshold. We identify hypothesis class as the structural driver of this phenomenon, which we term the Explanation Lottery. We theoretically show that the resulting Agreement Gap persists under interaction structure in the data-generati
arXiv:2603.15821v1 Announce Type: new Abstract: The assumption that prediction-equivalent models produce equivalent explanations underlies many practices in explainable AI, including model selection, auditing, and regulatory evaluation. In this work, we show that this assumption does not hold. Through a large-scale empirical study across 24 datasets and multiple model classes, we find that models with identical predictive behavior can produce substantially different feature attributions. This disagreement is highly structured: models within the same hypothesis class exhibit strong agreement, while cross-class pairs (e.g., tree-based vs. linear) trained on identical data splits show substantially reduced agreement, consistently near or below the lottery threshold. We identify hypothesis class as the structural driver of this phenomenon, which we term the Explanation Lottery. We theoretically show that the resulting Agreement Gap persists under interaction structure in the data-generating process. This structural finding motivates a post-hoc diagnostic, the Explanation Reliability Score R(x), which predicts when explanations are stable across architectures without additional training. Our results demonstrate that model selection is not explanation-neutral: the hypothesis class chosen for deployment can determine which features are attributed responsibility for a decision.
Executive Summary
This article challenges the conventional assumption in explainable AI that prediction-equivalent models produce equivalent explanations. Through a large-scale empirical study, the authors demonstrate that models within the same hypothesis class exhibit strong agreement in feature attributions, while models from different hypothesis classes, such as tree-based and linear, show substantially reduced agreement. The authors term this phenomenon the Explanation Lottery and identify hypothesis class as the structural driver. They also develop a post-hoc diagnostic, the Explanation Reliability Score R(x), to predict when explanations are stable across architectures. The study highlights the importance of considering hypothesis class in model selection and evaluation, as it can determine which features are attributed responsibility for a decision.
Key Points
- ▸ The assumption that prediction-equivalent models produce equivalent explanations is challenged
- ▸ Models within the same hypothesis class exhibit strong agreement in feature attributions
- ▸ Hypothesis class is identified as the structural driver of the Explanation Lottery
- ▸ A post-hoc diagnostic, the Explanation Reliability Score R(x), is developed to predict explanation stability
- ▸ Hypothesis class affects model selection and evaluation
Merits
Strength of Empirical Study
The authors conduct a large-scale empirical study across 24 datasets and multiple model classes, providing robust evidence for the Explanation Lottery.
Theoretical Insights
The authors provide theoretical insights into the persistence of the Agreement Gap under interaction structure in the data-generating process.
Post-hoc Diagnostic Development
The authors develop a post-hoc diagnostic, the Explanation Reliability Score R(x), to predict when explanations are stable across architectures.
Demerits
Limited Generalizability
The study's focus on a specific set of datasets and model classes may limit the generalizability of the findings to other domains and applications.
Need for Further Investigation
The Explanation Lottery phenomenon requires further investigation to fully understand its implications and potential mitigation strategies.
Expert Commentary
The article presents a significant challenge to the conventional approach in explainable AI, highlighting the importance of considering hypothesis class in model selection and evaluation. The authors' empirical study and theoretical insights provide a robust foundation for understanding the Explanation Lottery phenomenon. However, the study's limitations and the need for further investigation underscore the complexity of this issue. The development of post-hoc diagnostics, such as the Explanation Reliability Score R(x), offers a promising direction for mitigating the effects of the Explanation Lottery. Ultimately, this study highlights the need for a more nuanced understanding of model interpretability and the potential consequences of hypothesis class selection.
Recommendations
- ✓ Future research should investigate the Explanation Lottery phenomenon in different domains and applications to assess its generalizability.
- ✓ Developing more advanced post-hoc diagnostics and mitigation strategies is essential to address the potential consequences of hypothesis class selection.