From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models
arXiv:2603.04828v1 Announce Type: new Abstract: Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, a
arXiv:2603.04828v1 Announce Type: new Abstract: Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.
Executive Summary
This article proposes a novel method, GDS, for detecting pre-training data in large language models (LLMs) by analyzing gradient deviations during training. The authors observe that familiar samples exhibit systematic differences in gradient behavior, including smaller update magnitudes, distinct update locations, and more sharply activated neurons. GDS represents each sample using gradient profiles, revealing consistent distinctions between member and non-member data. Experiments on five public datasets demonstrate state-of-the-art performance and improved cross-dataset transferability. The method's interpretability allows for practical and scalable pre-training data detection. This approach addresses concerns about copyright and benchmark contamination, offering a promising solution for the LLM community.
Key Points
- ▸ GDS method detects pre-training data by analyzing gradient deviations during training
- ▸ Familiar samples exhibit systematic differences in gradient behavior
- ▸ GDS achieves state-of-the-art performance and improved cross-dataset transferability
Merits
Strength in Addressing Copyright Concerns
The proposed method effectively addresses concerns about copyright infringement by identifying pre-training data.
Improved Interpretability
GDS offers practical and scalable pre-training data detection through its interpretable features.
Demerits
Dependence on Gradient Profiles
The method's effectiveness relies on the accurate representation of gradient profiles, which may be challenging to achieve in all scenarios.
Potential Overfitting
The reliance on lightweight classifiers for binary membership inference may lead to overfitting, especially with small datasets.
Expert Commentary
The article presents a novel and promising approach to pre-training data detection in LLMs. By analyzing gradient deviations, the authors demonstrate a more effective and interpretable method than existing techniques. However, the method's dependence on gradient profiles and potential overfitting issues require further investigation. The implications of this work are significant, particularly in addressing copyright concerns and improving the integrity of LLMs. As the field continues to evolve, it is essential to develop and refine methods that ensure the responsible use of pre-trained models.
Recommendations
- ✓ Further investigation into the method's robustness and scalability is necessary to ensure its widespread adoption.
- ✓ The development of more robust gradient profiles and classifiers can enhance the method's effectiveness and reduce the risk of overfitting.