32
Privacy-Preserving Techniques in Generative AI and Large Language Models: A Narrative Review
www.mdpi.comGenerative AI, including large language models (LLMs), has transformed the paradigm of data generation and creative content, but this progress raises critical privacy concerns, especially when models are trained on sensitive data. This review provides a comprehensive overview of privacy-preserving techniques aimed at safeguarding data privacy in generative AI, such as differential privacy (DP), federated learning (FL), homomorphic encryption (HE), and secure multi-party computation (SMPC). These techniques mitigate risks like model inversion, data leakage, and membership inference attacks, which are particularly relevant to LLMs. Additionally, the review explores emerging solutions, including privacy-enhancing technologies and post-quantum cryptography, as future directions for enhancing privacy in generative AI systems. Recognizing that achieving absolute privacy is mathematically impossible, the review emphasizes the necessity of aligning technical safeguards with legal and regulatory frameworks to ensure compliance with data protection laws. By discussing the ethical and legal implications of privacy risks in generative AI, the review underscores the need for a balanced approach that considers performance, scalability, and privacy preservation. The findings highlight the need for ongoing research and innovation to develop privacy-preserving techniques that keep pace with the scaling of generative AI, especially in large language models, while adhering to regulatory and ethical standards.
I still dont see anything about encrypted inference this mostly looks like ways to avoid having the model retain sensitive data during its training. What i really would like to see is a way to send encrypted data off to a cloud where i can pay for inference compute then receive something i can unscramble to get the actual responce without said cloud ever havibg the raw unencrypted data.
EDIT: did some research and it seems that fully encrypted inference without leaking data via semantic meaning is possible at the small cost of making inferance 52000 times more computationally expensive. Seems more research is required.
And learning from the dataset is kinda the whole point of LLMs, right? I see some fundamental problems there. If you ask it where Alpacas are from, or which symptoms make some medical conditions, you want it to return what it memorized earlier. It kind of doesn’t help if it makes something else up to “preserve privacy”.
Do they address that? I see lots of flowery words like
But I mean that’s just silly.
Are you referring to retraining models with the same training data encrypted with your own key, and only interacting with the model via the same key? That’s the only way I’ve heard possible, but that was a year or more ago.