Skip to main content
Academic

Personal Information Parroting in Language Models

arXiv:2602.20580v1 Announce Type: new Abstract: Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7%

N
Nishant Subramani, Kshitish Ghate, Mona Diab
· · 1 min read · 0 views

arXiv:2602.20580v1 Announce Type: new Abstract: Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.

Executive Summary

The article discusses the risks of personal information parroting in language models, where models memorize and reproduce sensitive information such as email addresses, phone numbers, and IP addresses. The authors develop a detector suite to identify such instances and find that larger models and longer pretraining times are positively correlated with memorization. They strongly recommend filtering and anonymizing pretraining datasets to minimize these risks.

Key Points

  • Language models can memorize and parrot personal information, posing significant privacy risks
  • The regexes and rules detector suite outperforms existing regex-based detectors
  • Model size and pretraining time are positively correlated with memorization of personal information

Merits

Novel Detector Suite

The authors develop a novel detector suite that outperforms existing regex-based detectors, providing a valuable tool for identifying personal information parroting

Demerits

Limited Dataset

The study is based on a manually curated set of 483 instances of personal information, which may not be representative of the broader range of personal information found in language models

Expert Commentary

The article raises important concerns about the risks of personal information parroting in language models. The development of the regexes and rules detector suite is a significant contribution to the field, providing a valuable tool for identifying and mitigating these risks. However, the study's findings also highlight the need for more comprehensive and nuanced approaches to addressing data privacy in language models, including the development of more effective anonymization techniques and regulatory frameworks.

Recommendations

  • Language model developers should prioritize filtering and anonymizing pretraining datasets
  • Regulatory bodies should establish guidelines for the responsible development and deployment of language models that handle personal information

Sources