Academic

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

arXiv:2603.03672v1 Announce Type: new Abstract: The Shapley value provides a principled foundation for data valuation, but exact computation is #P-hard due to the exponential coalition space. Existing accelerations remain global and ignore a structural property of modern predictors: for a given test instance, only a small subset of training points influences the prediction. We formalize this model-induced locality through support sets defined by the model's computational pathway (e.g., neighbors in KNN, leaves in trees, receptive fields in GNNs), showing that Shapley computation can be projected onto these supports without loss when locality is exact. This reframes Shapley evaluation as a structured data processing problem over overlapping support-induced subset families rather than exhaustive coalition enumeration. We prove that the intrinsic complexity of Local Shapley is governed by the number of distinct influential subsets, establishing an information-theoretic lower bound on ret

X
Xuan Yang, Hsi-Wen Chen, Ming-Syan Chen, Jian Pei
· · 1 min read · 15 views

arXiv:2603.03672v1 Announce Type: new Abstract: The Shapley value provides a principled foundation for data valuation, but exact computation is #P-hard due to the exponential coalition space. Existing accelerations remain global and ignore a structural property of modern predictors: for a given test instance, only a small subset of training points influences the prediction. We formalize this model-induced locality through support sets defined by the model's computational pathway (e.g., neighbors in KNN, leaves in trees, receptive fields in GNNs), showing that Shapley computation can be projected onto these supports without loss when locality is exact. This reframes Shapley evaluation as a structured data processing problem over overlapping support-induced subset families rather than exhaustive coalition enumeration. We prove that the intrinsic complexity of Local Shapley is governed by the number of distinct influential subsets, establishing an information-theoretic lower bound on retraining operations. Guided by this result, we propose LSMR (Local Shapley via Model Reuse), an optimal subset-centric algorithm that trains each influential subset exactly once via support mapping and pivot scheduling. For larger supports, we develop LSMR-A, a reuse-aware Monte Carlo estimator that remains unbiased with exponential concentration, with runtime determined by the number of distinct sampled subsets rather than total draws. Experiments across multiple model families demonstrate substantial retraining reductions and speedups while preserving high valuation fidelity.

Executive Summary

The article introduces Local Shapley, a novel approach to data valuation that leverages model-induced locality to efficiently compute Shapley values. By projecting Shapley computation onto support sets defined by the model's computational pathway, the authors demonstrate substantial retraining reductions and speedups while preserving high valuation fidelity. The proposed algorithm, LSMR, trains each influential subset exactly once, and its Monte Carlo estimator, LSMR-A, remains unbiased with exponential concentration.

Key Points

  • Local Shapley leverages model-induced locality to efficiently compute Shapley values
  • Support sets are defined by the model's computational pathway, such as neighbors in KNN or leaves in trees
  • The proposed algorithm, LSMR, trains each influential subset exactly once via support mapping and pivot scheduling

Merits

Efficient Computation

Local Shapley reduces the computational complexity of Shapley value calculation by leveraging model-induced locality

High Valuation Fidelity

The proposed algorithm preserves high valuation fidelity despite substantial retraining reductions and speedups

Demerits

Limited Applicability

The approach may be limited to specific model families and datasets, requiring further research to generalize its applicability

Expert Commentary

The article presents a significant contribution to the field of data valuation, demonstrating the potential for model-induced locality to improve the efficiency and accuracy of Shapley value calculation. The proposed algorithm, LSMR, and its Monte Carlo estimator, LSMR-A, offer a promising approach to reducing the computational complexity of Shapley value calculation while preserving high valuation fidelity. Further research is needed to generalize the applicability of Local Shapley and explore its implications for explainability, transparency, and data valuation.

Recommendations

  • Future research should investigate the applicability of Local Shapley to various model families and datasets
  • The development of regulations and standards governing data valuation and pricing should take into account the article's findings on the potential for model-induced locality to improve the efficiency and accuracy of data valuation

Sources