Skip to main content
Academic

EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors

arXiv:2602.21218v1 Announce Type: cross Abstract: High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using *dataset vectors*--directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yie

arXiv:2602.21218v1 Announce Type: cross Abstract: High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using dataset vectors--directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yielding strong fidelity even in low-data regimes. Furthermore, we enhance our method by utilizing pretrained (base) models and introducing fixed-shot prompting to boost generation diversity and fidelity. Our experiments demonstrate that EPSVec outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.

Executive Summary

The article introduces EPSVec, a novel method for efficient and private synthetic data generation via dataset vectors. This approach decouples the privacy budget from generation, enabling the creation of arbitrarily many synthetic samples without additional privacy cost. EPSVec outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while reducing computational overhead. The method utilizes pretrained models and fixed-shot prompting to boost generation diversity and fidelity, making it a promising solution for high-quality data generation while maintaining data privacy.

Key Points

  • EPSVec uses dataset vectors to steer large language models for synthetic data generation
  • The method decouples the privacy budget from generation, reducing computational overhead
  • EPSVec outperforms existing baselines in distributional alignment and downstream utility, especially in low-data regimes

Merits

Efficient Generation

EPSVec enables the creation of arbitrarily many synthetic samples without additional privacy cost, making it a efficient solution for data generation

Improved Fidelity

The method utilizes pretrained models and fixed-shot prompting to boost generation diversity and fidelity, resulting in high-quality synthetic data

Demerits

Limited Applicability

EPSVec may not be suitable for all types of data or domains, and its performance may vary depending on the specific use case

Expert Commentary

The introduction of EPSVec marks a significant advancement in the field of synthetic data generation. By decoupling the privacy budget from generation, EPSVec addresses a major limitation of existing methods, which are often computationally slow and require large private corpora. The method's ability to generate high-quality synthetic data while maintaining data privacy makes it a promising solution for various applications. However, further research is needed to fully explore the potential of EPSVec and its limitations, particularly in terms of its applicability to different domains and data types.

Recommendations

  • Further research should be conducted to explore the applicability of EPSVec to different domains and data types
  • The method should be compared to other state-of-the-art solutions for synthetic data generation to fully evaluate its performance and limitations

Sources