Academic

Soft Contamination Means Benchmarks Test Shallow Generalization

arXiv:2602.12413v1 Announce Type: cross Abstract: If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of

arXiv:2602.12413v1 Announce Type: cross Abstract: If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.

Executive Summary

The article 'Soft Contamination Means Benchmarks Test Shallow Generalization' investigates the impact of semantic duplicates in training data on the performance of large language models (LLMs). The study reveals that traditional decontamination methods fail to detect semantic duplicates, leading to biased benchmark performance estimates. The research finds widespread soft contamination, with significant percentages of benchmark problems having semantic or exact duplicates in the training corpus. Including these duplicates in training improves benchmark performance, confounding recent gains in benchmark results. The study argues that these gains reflect both genuine capability improvements and the accumulation of test data in training corpora.

Key Points

  • Traditional decontamination methods fail to detect semantic duplicates, leading to biased benchmark performance estimates.
  • Soft contamination remains widespread, with 78% of CodeForces and 50% of ZebraLogic problems having semantic or exact duplicates in the training corpus.
  • Including semantic duplicates in training improves benchmark performance, confounding recent gains in benchmark results.

Merits

Comprehensive Analysis

The article provides a thorough analysis of the impact of semantic duplicates on benchmark performance, using extensive experiments and embedding techniques.

Empirical Evidence

The study presents empirical evidence to support its claims, including specific percentages of contamination and performance improvements.

Relevance to Current Practices

The findings are highly relevant to current practices in LLM training and benchmarking, highlighting a significant issue in the field.

Demerits

Limited Scope

The study focuses primarily on semantic duplicates and does not address other potential sources of bias or contamination in training data.

Generalizability

The findings may not be generalizable to all types of benchmarks or training corpora, as the study focuses on specific datasets.

Methodological Limitations

The study relies on embedding techniques for detecting semantic duplicates, which may have their own limitations and biases.

Expert Commentary

The article 'Soft Contamination Means Benchmarks Test Shallow Generalization' presents a rigorous and well-reasoned analysis of the impact of semantic duplicates on the performance of large language models. The study's findings are particularly relevant in the current landscape of AI research, where benchmarking plays a crucial role in evaluating model capabilities. The identification of widespread soft contamination highlights a significant issue that has been largely overlooked in the field. The study's empirical evidence, including the specific percentages of contamination and performance improvements, adds substantial value to the discussion. However, the study's limitations, such as its focus on specific datasets and the reliance on embedding techniques, should be acknowledged. The implications of the findings are far-reaching, affecting both practical applications and policy considerations. The article's call for more robust decontamination methods and standardized guidelines is well-founded and aligns with the broader goals of ensuring the integrity and reliability of AI systems. Overall, the study makes a valuable contribution to the field and should be carefully considered by researchers, developers, and policymakers alike.

Recommendations

  • Develop and implement more sophisticated decontamination techniques that can detect semantic duplicates and other forms of contamination in training data.
  • Establish standardized guidelines for data decontamination and benchmarking to ensure the reliability and comparability of model evaluations.

Sources