Academic

RAT-Bench: A Comprehensive Benchmark for Text Anonymization

arXiv:2602.12806v1 Announce Type: new Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report the risk of re-identification in the U.S. population

arXiv:2602.12806v1 Announce Type: new Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report the risk of re-identification in the U.S. population, while properly accounting for the disparate impact of identifiers. We find that, while capabilities vary widely, even the best tools are far from perfect in particular when direct identifiers are not written in standard ways and when indirect identifiers enable re-identification. Overall we find LLM-based anonymizers, including new iterative anonymizers, to provide a better privacy-utility trade-off albeit at a higher computational cost. Importantly, we also find them to work well across languages. We conclude with recommendations for future anonymization tools and will release the benchmark and encourage community efforts to expand it, in particular to other geographies.

Executive Summary

The article introduces RAT-Bench, a comprehensive benchmark for evaluating text anonymization tools based on re-identification risk. It highlights the increasing use of personal data in training Large Language Models (LLMs) and the need for effective anonymization tools. The study evaluates various NER- and LLM-based anonymization tools using synthetic text containing direct and indirect identifiers. The findings indicate that while LLM-based anonymizers offer a better privacy-utility trade-off, they are not perfect, especially when dealing with non-standard direct identifiers and indirect identifiers that enable re-identification. The article concludes with recommendations for future anonymization tools and emphasizes the importance of community efforts to expand the benchmark to other geographies.

Key Points

  • Introduction of RAT-Bench for evaluating text anonymization tools based on re-identification risk.
  • Evaluation of various anonymization tools using synthetic text with direct and indirect identifiers.
  • LLM-based anonymizers provide a better privacy-utility trade-off but are not perfect.
  • Recommendations for future anonymization tools and the importance of community efforts to expand the benchmark.

Merits

Comprehensive Benchmark

RAT-Bench provides a thorough evaluation framework for text anonymization tools, considering various identifiers and demographic statistics.

Balanced Evaluation

The study evaluates both NER- and LLM-based tools, providing a balanced comparison of their effectiveness and trade-offs.

Cross-Language Applicability

The findings demonstrate the effectiveness of LLM-based anonymizers across different languages, highlighting their versatility.

Demerits

Limitations in Handling Non-Standard Identifiers

The study notes that even the best tools struggle with non-standard direct identifiers and indirect identifiers that enable re-identification.

Computational Cost

LLM-based anonymizers, while effective, come with higher computational costs, which may limit their practical applicability.

Geographic Focus

The benchmark is currently focused on U.S. demographic statistics, which may limit its applicability to other regions.

Expert Commentary

The introduction of RAT-Bench represents a significant advancement in the evaluation of text anonymization tools. By focusing on re-identification risk, the study addresses a critical gap in the current methodologies, which often overlook the nuanced ways in which identifiers can enable re-identification. The comprehensive evaluation of both NER- and LLM-based tools provides valuable insights into their respective strengths and limitations. The finding that LLM-based anonymizers offer a better privacy-utility trade-off, albeit at a higher computational cost, underscores the need for continued research and development in this area. The study's emphasis on the importance of community efforts to expand the benchmark to other geographies is particularly noteworthy, as it highlights the global relevance of data privacy concerns. Overall, the article makes a substantial contribution to the field of data privacy and sets a new standard for evaluating text anonymization tools.

Recommendations

  • Future research should focus on improving the handling of non-standard direct identifiers and indirect identifiers in anonymization tools.
  • Policymakers should consider the findings of this study when developing regulations related to data privacy and the use of personal data in training LLMs.

Sources