Synthetic Data, Hashing, Enterprise Data Leakage, and the Reality of Privacy Risks: What to Know

cover
28 Jul 2024

The timely “No, Hashing Still Doesn’t Make Your Data Anonymous” post from the FTC is a great reminder that, especially with the rise of large language models (LLMs) and generative AI, how those models are trained and fine-tuned creates opportunities for massive data leakage.

Synthetic data is often considered the convenient solution to the data privacy challenges associated with LLM training and fine-tuning. However, synthetic data is not equivalent to anonymous or de-identified data.

The Appeal and Enthusiasm for Synthetic Data

Synthetic data fills gaps where real data is hard to obtain. It’s great for simulating rare events or generating large datasets for many machine-learning models. Advances in AI-generated tools have enhanced synthetic data’s capabilities, making it more versatile and powerful.

Synthetic data accelerates innovation by enabling rapid development and testing of new algorithms and technologies. It provides a sandbox for experimentation without the constraints of real-world data limitations. As TDS Editors mention, “Using synthetic data isn’t exactly a new practice: it’s been a productive approach for several years now.”

Generative AI and Privacy Risks

Despite the enthusiasm, it’s critical to recognize that synthetic data is not inherently anonymous. Generative AI and LLMs have unique privacy risks. Synthetic data can still reflect patterns from real data, leading to potential re-identification risks.

From a data loss prevention and data privacy point of view, there are several additional risks unique to LLMs:

  • Membership and Property Leakage from Pre-Training Data: LLMs can inadvertently memorize sensitive information during their training on vast datasets.

  • Model Features Leakage from Trained LLM: Features learned during training can leak sensitive information when the model is queried.

  • Privacy Leakage from Conversations (History) with LLMs: Data from user interactions can be stored and potentially exposed.

  • Compliance with Privacy Intent of Users: Ensuring that the data used and generated by LLMs complies with user consent and privacy expectations.

Enterprise data leakage becomes more relevant when leveraging LLMs in settings such as Retrieval-Augmented Generation (RAG) or fine-tuning LLMs with enterprise data to create domain-specific models. To prevent leaks, the privacy of enterprise (training) data must be safeguarded.

Synthetic Data: Appealing Yet Not Automatically Anonymous

Synthetic data addresses many issues related to privacy and enterprise data leakage. However, as previously stated, synthetic data is not automatically anonymous. Evaluating the quality of synthetic data and avoiding the pitfalls highlighted in studies such as “On the Inadequacy of Similarity-based Privacy Metrics: Reconstruction Attacks against ‘Truly Anonymous Synthetic Data’” is crucial.

It’s much better to rely on a mathematical definition of privacy guarantees like differential privacy. Differential privacy provides robust guarantees by ensuring that the inclusion or exclusion of a single data point does not significantly affect the outcome, thereby protecting individual privacy.

Evaluating Synthetic Data Quality

To ensure synthetic data is effective and safe, evaluate its quality rigorously.

  • Fidelity: Measures how well synthetic data reflects the statistical properties of real data. High fidelity means the synthetic data maintains the patterns and distributions of the original dataset. “If our model can produce synthetic data that can be considered to be a random sample from the same parent distribution, then we’ve hit the jackpot.”

  • Utility: Determines whether synthetic data can be used effectively for its intended purpose. High-utility synthetic data should yield similar results to real data in machine learning models or statistical analyses. “The synthetic data should be just as useful when put to tasks such as regression or classification.”

  • Privacy: Ensures that synthetic data does not allow for re-identification of individuals. The Maximum Similarity Test compares the similarities within and between datasets to detect privacy breaches. Skabar highlights that “the biggest danger with synthetic data points being too close to observed points is privacy; i.e., being able to identify points in the observed set from points in the synthetic set.”

Practical Techniques

  • Maximum Similarity Test: This test involves calculating the similarity between instances in synthetic and real datasets and comparing these similarities. Similar distributions indicate high fidelity and utility without compromising privacy.

  • Train on Synthetic, Test on Real (TSTR): Train a machine learning model on synthetic data and test it on real data. High performance in this test shows that the synthetic data maintains utility and accurately represents real data.

  • Gower Similarity: For mixed-type datasets, Gower Similarity measures the distance between data points, providing a comprehensive similarity measure that accommodates various data types.

The FTC article about the ineffectiveness of hashing as privacy protection is a good reminder that robust privacy-preserving techniques and thorough quality evaluations are key to ensuring synthetic datasets are safe and functional.