All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting
arXiv:2602.17234v1 Announce Type: new Abstract: To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim's contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framewor
arXiv:2602.17234v1 Announce Type: new Abstract: To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim's contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims (\textbf{TimeSPEC}), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination -- producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.
Executive Summary
This article presents a novel approach to detecting and quantifying temporal knowledge leakage in large language models (LLMs) during backtesting. The authors propose a claim-level framework that decomposes model rationales into atomic claims and categorizes them by temporal verifiability. They introduce the Shapley-weighted Decision-Critical Leakage Rate (Shapley-DCLR), an interpretable metric that captures the fraction of decision-driving reasoning derived from leaked information. The authors also propose Time-Supervised Prediction with Extracted Claims (TimeSPEC), a method that interleaves generation with claim verification and regeneration to proactively filter temporal contamination. Experiments on three tasks demonstrate substantial leakage in standard prompting baselines and the effectiveness of TimeSPEC in reducing Shapley-DCLR while preserving task performance.
Key Points
- ▸ The authors propose a claim-level framework for detecting and quantifying temporal knowledge leakage in LLMs.
- ▸ The framework introduces the Shapley-weighted Decision-Critical Leakage Rate (Shapley-DCLR) for measuring leakage.
- ▸ Time-Supervised Prediction with Extracted Claims (TimeSPEC) is proposed to proactively filter temporal contamination.
Merits
Improved interpretability
The claim-level framework and Shapley-DCLR provide a deeper understanding of the decision-making process in LLMs, enabling more accurate evaluation of their performance.
Enhanced reliability
TimeSPEC's proactive filtering of temporal contamination ensures that predictions are based on information available before the cutoff date, increasing the reliability of LLMs in backtesting.
Task performance preservation
The authors demonstrate that TimeSPEC can preserve task performance while reducing Shapley-DCLR, highlighting its potential for practical applications.
Demerits
Limited scope
The article focuses on three specific tasks and may not generalize to other domains or applications, limiting its broader impact.
Computational complexity
The claim-level framework and TimeSPEC may require significant computational resources, potentially hindering their adoption in resource-constrained environments.
Dependence on training data
The effectiveness of TimeSPEC relies on the quality and availability of training data, which may not always be the case in real-world scenarios.
Expert Commentary
While the article presents a novel and promising approach to detecting temporal knowledge leakage in LLMs, its limitations and potential computational complexities should not be overlooked. Future research should focus on expanding the scope of the claim-level framework and TimeSPEC to more domains and applications, as well as exploring methods to mitigate the dependence on training data. The implications of this research are significant, particularly in areas where trustworthiness and reliability are essential, and it is expected that this work will contribute to ongoing discussions on explainability and accountability in AI.
Recommendations
- ✓ Researchers should explore the application of the claim-level framework and TimeSPEC to other domains and tasks to broaden their impact.
- ✓ Developers should prioritize the implementation of TimeSPEC in real-world scenarios to assess its practicality and limitations.