Academic

Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

arXiv:2603.20957v1 Announce Type: new Abstract: Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. T

arXiv:2603.20957v1 Announce Type: new Abstract: Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.

Executive Summary

This groundbreaking study dismantles a central defense in the AI industry’s copyright infringement litigation by demonstrating that finetuning—a common post-training optimization technique—can systematically extract verbatim copies of copyrighted books from leading large language models (LLMs) such as GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1. Using only semantic plot summaries as prompts, the authors achieve up to 90% extraction of held-out copyrighted texts, with spans exceeding 460 words, and show that this vulnerability generalizes across authors and providers. The findings reveal that model memorization of training data is latent and reactivated through author-specific finetuning, challenging assurances of data non-reproduction and undermining judicial assumptions about alignment safeguards in fair use analyses.

Key Points

  • Finetuning bypasses RLHF, system prompts, and output filters, enabling verbatim recall of copyrighted books from multiple providers’ models.
  • Extraction generalizes across unrelated authors and datasets—finetuning on Haruki Murakami’s novels unlocks recall of books by over 30 other authors, with inter-model agreement on memorized spans (r ≥ 0.90).
  • The phenomenon is contingent on finetuning on real text from specific authors; synthetic data finetuning yields negligible extraction, suggesting reactivation of pretraining memorization rather than new learning.
  • The results directly contradict industry claims of non-storage and cast doubt on the sufficiency of current alignment and filtering mechanisms as defenses against copyright infringement.

Merits

Rigorous Experimental Design

The study employs a controlled, multi-model evaluation across leading LLMs, using standardized finetuning tasks (e.g., plot-to-full-text expansion) and rigorous statistical analysis of extraction fidelity, including inter-model correlation metrics.

Novel Discovery

The identification of a latent, author-specific reactivation mechanism via finetuning reveals a previously unrecognized security flaw in LLMs, challenging assumptions about the boundaries between training and inference.

Strong External Validity

Findings generalize across providers, authors, and languages, suggesting a systemic industry-wide vulnerability rather than a model-specific quirk.

Demerits

Scope of Memorization Assessment

The study focuses on narrative prose and may not fully capture the extent of memorization in non-literary or highly technical domains where text structure differs significantly.

Limited Legal Analysis

While the article critiques fair use rulings, it does not engage deeply with the doctrinal nuances of copyright law, such as the distinction between expression and idea or the role of transformative use in model outputs.

Finetuning Task Specificity

The extraction efficacy is demonstrated primarily through a single task (plot expansion); it remains unclear whether similar vulnerabilities arise in other finetuning objectives or downstream applications.

Expert Commentary

This study represents a paradigm shift in the discourse on AI memorization and copyright, moving beyond surface-level debates about whether models 'store' training data to demonstrate how latent knowledge can be systematically reactivated through seemingly benign finetuning tasks. The authors’ empirical rigor is commendable, and their findings should prompt urgent reassessment of the legal and technical underpinnings of AI safety claims. The generalization of extraction across models and authors suggests that this is not an isolated flaw but a systemic vulnerability rooted in the architecture of modern LLMs. From a policy perspective, the findings underscore the inadequacy of reactive measures like output filtering, which have been relied upon in judicial decisions to justify fair use. Instead, the onus must shift to proactive safeguards, including transparency in training data composition and mandatory adversarial testing of finetuning pipelines. The article also raises profound questions about the limits of alignment: if finetuning can unlock memorization, then the very process of improving model utility may inadvertently increase infringement risk. This tension between performance and protection will likely define the next phase of AI regulation and litigation.

Recommendations

  • Conduct mandatory adversarial finetuning stress tests for all commercial LLMs, requiring disclosure of extraction success rates under standardized prompts and tasks.
  • Develop and implement synthetic data policies that mandate the use of non-copyrighted or adversarially generated finetuning data, with regular audits to ensure compliance.
  • Expand judicial scrutiny of alignment claims in copyright cases to include empirical validation of safeguard efficacy, particularly where finetuning is involved.
  • Establish a cross-industry consortium to standardize memorization extraction testing protocols, similar to existing safety benchmark initiatives (e.g., MLPerf, HELM).
  • Amend regulatory frameworks to require that AI systems undergo 'memorization risk assessments' prior to deployment, with tailored requirements based on intended use cases (e.g., writing assistants vs. coding tools).

Sources

Original: arXiv - cs.CL