Skip to main content
Academic

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

arXiv:2602.17234v1 Announce Type: new Abstract: To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim's contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framewor

Z
Zeyu Zhang, Ryan Chen, Bradly C. Stadie
· · 1 min read · 5 views

arXiv:2602.17234v1 Announce Type: new Abstract: To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim's contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims (\textbf{TimeSPEC}), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination -- producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.

Executive Summary

This article presents a novel approach to detecting and quantifying temporal knowledge leakage in large language models (LLMs) during backtesting. The authors propose a claim-level framework that decomposes model rationales into atomic claims and categorizes them by temporal verifiability. They introduce the Shapley-weighted Decision-Critical Leakage Rate (Shapley-DCLR), an interpretable metric that captures the fraction of decision-driving reasoning derived from leaked information. The authors also propose Time-Supervised Prediction with Extracted Claims (TimeSPEC), a method that interleaves generation with claim verification and regeneration to proactively filter temporal contamination. Experiments on three tasks demonstrate substantial leakage in standard prompting baselines and the effectiveness of TimeSPEC in reducing Shapley-DCLR while preserving task performance.

Key Points

  • The authors propose a claim-level framework for detecting and quantifying temporal knowledge leakage in LLMs.
  • The framework introduces the Shapley-weighted Decision-Critical Leakage Rate (Shapley-DCLR) for measuring leakage.
  • Time-Supervised Prediction with Extracted Claims (TimeSPEC) is proposed to proactively filter temporal contamination.

Merits

Improved interpretability

The claim-level framework and Shapley-DCLR provide a deeper understanding of the decision-making process in LLMs, enabling more accurate evaluation of their performance.

Enhanced reliability

TimeSPEC's proactive filtering of temporal contamination ensures that predictions are based on information available before the cutoff date, increasing the reliability of LLMs in backtesting.

Task performance preservation

The authors demonstrate that TimeSPEC can preserve task performance while reducing Shapley-DCLR, highlighting its potential for practical applications.

Demerits

Limited scope

The article focuses on three specific tasks and may not generalize to other domains or applications, limiting its broader impact.

Computational complexity

The claim-level framework and TimeSPEC may require significant computational resources, potentially hindering their adoption in resource-constrained environments.

Dependence on training data

The effectiveness of TimeSPEC relies on the quality and availability of training data, which may not always be the case in real-world scenarios.

Expert Commentary

While the article presents a novel and promising approach to detecting temporal knowledge leakage in LLMs, its limitations and potential computational complexities should not be overlooked. Future research should focus on expanding the scope of the claim-level framework and TimeSPEC to more domains and applications, as well as exploring methods to mitigate the dependence on training data. The implications of this research are significant, particularly in areas where trustworthiness and reliability are essential, and it is expected that this work will contribute to ongoing discussions on explainability and accountability in AI.

Recommendations

  • Researchers should explore the application of the claim-level framework and TimeSPEC to other domains and tasks to broaden their impact.
  • Developers should prioritize the implementation of TimeSPEC in real-world scenarios to assess its practicality and limitations.

Sources