UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

2 weeks ago·arXiv

Abstract

Ilia Shumailov1, Jamie Hayes1, Eleni Triantafillou1, Guillermo Ortiz-Jimenez1, Nicolas Papernot1, Matthew Jagielski1, Itay Yona1, Heidi Howard1 and Eugene Bagdasaryan1

Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.

Introduction

Recent advancements in Large Language Models (LLMs) raise concerns about their use for undesirable purposes. Unlearning emerged as a promising solution for knowledge control, originally developed for removal of privacy-sensitive information (Bourtoule et al., 2021). Since then, several works have attempted to utilize unlearning for a host of applications relating to the removal of undesired knowledge or behaviours: removing harmful capabilities (Lynch et al., 2024) or harmful responses (Liu et al., 2024; Yao et al., 2023), erasing backdoors (Liu et al., 2022) or specific information or knowledge pertaining to a particular topic (Eldan and Russinovich, 2023; Li et al., 2024), erasing copyrighted content (Yao et al., 2023) and even reducing hallucinations (Yao et al., 2023). Such applications have been studied in the context of diffusion models too, with various attempts to use unlearning to remove unsafe concepts (Fan et al., 2023; Zhang et al., 2023).

This paper discusses the application of unlearning to LLMs for removal of broadly impermissible knowledge, the use-case often discussed in policy

circles e.g. for removal of biological and nuclear knowledge (Li et al., 2024). In fact, we uncover a fundamental inconsistency of the unlearning paradigm for this application. While unlearning aims to erase knowledge, the inherent in-context learning (ICL) (Agarwal et al., 2024; Brown et al., 2020; Kossen et al., 2024) capabilities of LLMs introduce a major challenge. We introduce the concept of ununlearning, where successfully unlearned knowledge can resurface through contextual interactions. This raises a critical question: if unlearned information can be readily reintroduced, is unlearning a truly effective approach for making sure that the model does not exhibit impermissible behaviours? We discuss the ramifications of ununlearning, particularly the need for effective content regulation mechanisms to prevent the resurgence of undesirable knowledge. Ultimately, we question the long-term viability of unlearning as a primary tool for content regulation.

Note, this paper explicitly only considers the case when unlearning is used for purposes of content regulation i.e. problems formulated as as a model developer I do not want my model to be able to perform X, where X can be e.g. bioweapons development (Li et al., 2024). Entities deploying models operate under the expectation that those models don’t pose a risk of being exploited for dangerous applications like weapons development. Importantly, it does not cover the original use-case of unlearning for the privacy purposes.

Nomenclature

In what follows, we rely on six main terms:

(Informal) Definition 1. Knowledge refers to information available to the model. This information can take up different forms and includes e.g. in-context provided inputs, information stored in the parameters of the model, or evidence available for retrieval.

(Informal) Definition 2. Content filtering refers to the process of filtering out queries to and responses from a given model. Filtering can both be a part of the model, as well as, be external to it (Glukhov et al., 2023).

(Informal) Definition 3. Unlearning refers to a process in which knowledge is removed from a given model. This is a broad description that can encompasses different application scenarios. Below, we provide two informal definitions of unlearning.

(Informal) Definition 4. Unlearning for privacy seeks to remove knowledge that is defined as a particular subset of the model’s original training dataset, referred to as the “forget set”. Formal definitions (Ginart et al., 2019; Sekhari et al., 2021) require the (distribution of the) unlearned model to be indistinguishable from (the distribution of the) model retrained excluding the forget set. According to this notion, an unlearning method can either be exact (Bourtoule et al., 2021; Mure- sanu et al., 2024), guaranteeing indistinguishability between the aforementioned distributions, or inexact (Golatkar et al., 2020; Kurmanji et al., 2024; Thudi et al., 2022), instead offering an approximation to that goal in exchange for greater efficiency or better model utility.

(Informal) Definition 5. Unlearning for con-

tent regulation seeks to remove knowledge that is (believed to be) associated with producing impermissible content. Note that the specification of this knowledge in this case does not necessarily take the form of identifying a subset of the training dataset. Instead, it captures more broadly information that we want to remove from the model (regardless of which subset of training data led to acquiring it), in order to prevent it from generating impermissible content. This notion aligns with the recent related notion of Goel et al. (2024) for “corrective unlearning”.

(Informal) Definition 6. In-Context Learning refers to an emergent capability of language models to generalise to tasks from task descriptions without these tasks present in the training data (Brown et al., 2020; Kossen et al., 2024; Mil- ios et al., 2023). At the time of writing, such a model capability is ubiquitous, albeit some tasks are not perfectly solvable from descriptions alone.

For example, it may seem natural to remove any mention of what term bomb means from the training set of the model, expecting that when asked to make bombs the model would fail – simply because it possess no knowledge of what the word bomb means. Yet, if a term bomb, for example under a different name as we show in Figure 2, is defined as the a substance with certain properties and the model possess reasoning skills, a recipe for a bomb can be derived by the model, provided there is enough chemical knowledge still available to the model.

Importantly, we highlight that even exact unlearning would not stop the model from performing an impermissible act in the example above. That is because even though the model never saw the bomb-defining data, it still possesses all of the knowledge required to construct one. Consequently, no unlearning method can lead to ‘better removal’ of the impermissible knowledge, compared to the model that never trained on this knowledge in the first place. We therefore argue that if ununlearning poses an issue even in this ‘idealized unlearning’ setting, this problem can only be exacerbated under imperfect unlearning.

Given it is hard to reason about knowledge compositionality, i.e. how knowledge interacts with other knowledge available to the model, it is not always obvious what logical inference given (benign) knowledge will enable. As such, attribution of model behaviour to specific knowledge that a user introduced is not always trivial.

Types of Knowledge

To best understand the intuition behind ununlearning, we first discuss types of knowledge available to our models. We broadly classify knowledge into two main categories: axioms and theorems to represent facts and assumptions, and the derivative knowledge.

Consider the example depicted in Figure 1, where we have three concepts defined out of 6 axioms. Here we define an axioms of Ear, Eye, Tail, Big, Striped, and Gallops as the smallest units of knowledge. We then use these to define three main theorems. If it has Ear, Eye, and Tail then we say its a Cat. If it is a Cat, but also is Big and Striped then it is a Tiger. Yet, if it is Big, Striped and Gallops then its a Zebra.

Now lets consider the case where concept of Tiger is impermissible and any queries regarding the Tiger are not allowed. In other words, the model developer explicitly expects that the model would not be used to reason about Tigers. We unlearn the concept of Tiger perfectly, using either exact unlearning or a very strong approximate unlearning scheme. Yet, even with the concept of Tiger gone, all of the underlying axiomatic knowledge is still retained inside of the model, since its features are used for other theorems available in the model, namely Zebra and Big in the example. If this knowledge were to also be unlearned from the model, it would start impacting the utility on other permissible tasks e.g. for reasoning about Zebra.

In the same way, many other conflicting objectives will leave the model possessing impermissible knowledge. For example, it may be desirable for the model to respond to high-school chemistry essay questions, yet not to have bomb-making knowledge. Figure 2 presents one such example.

UnUnlearning

Setting: Assume a model that takes in 𝑥 ∈ X and outputs Y. Note that here X includes both training and inference-time data. We assume we have an unlearning method that takes in a model and points ˆ X ⊆ X and outputs a model ˆthat unlearned the said points 𝑢(𝑀, ˆ𝑋) = ˆ𝑀 . In this paper we assume that ˆX is the impermissible knowledge that is defined and identified by the model developer. We argue that by relying on in-context learning (ICL) an adversary can bring back the knowledge such that ˆwhen prompted with special context outputs the same results are 𝑀 over ˆX . In other words, 𝑀( ˆ𝑋) ≈ ˆ𝑀(𝑝𝑟𝑜𝑚𝑝𝑡+ ˆ𝑋).

(Informal) Definition 7. UnUnlearning refers to a process where previously removed or never learned knowledge is instructed into the model using in-context learning.

In this paper we argue that ununlearning is a concern that should be taken into account when unlearning schemes are used and designed for removing impermissible functionality; as well as, models openly released. Note that ununlearning applies even to exact unlearning.

Discussion

There is a number of ramification of ununlearning.

UnUnlearning implies that for unlearning to remain effective, one needs to not only remove impermissible knowledge, but also perform a continuous, active process of suppressing in-context attempts to reintroduce that knowledge. Fundamental computational bounds suggest that such filtering is likely limited too, for example because of dependence on the context in which queries appear (Glukhov et al., 2023).

Definitions and Mechanisms for Unlearning Given the above, is traditional unlearning by itself even a worthwhile pursuit for controlling the use of impermissible knowledge? Consider the Tiger–Zebra setting we discussed earlier. If we accept classic definitions for unlearning, a model that never trained on any Tiger knowledge represents perfect unlearning by construction. In other

Figure 1 | We broadly separate the knowledge into two main types: axioms and theorems to represent given facts and derived knowledge respectively. In the example above we assume that all theorems are defined in terms of underlying axioms, where some axioms are shared by different theorems. While unlearning of Cat may let the model forget what Cat means, it is relatively easy to redefine it provided that the underlying axioms are preserved.

Figure 2 | The figure demonstrates the concept of ununlearning. Here, the model that at first possess impermissible bomb making knowledge. The defender uses exact unlearning to remove all instances of usage of the term bomb, making the model incapable of providing bomb recipes, since it does not possess the knowledge of what that term describes. The adversary uses the knowledge still available to the model to describe the concept and as a result the model provides a response.

words, no unlearning method would perform better than a model that never saw any of the knowledge. Yet, since the model with in-context introduced concept of Tiger can successfully reason about the Tiger it is clearly not confirming to the objective of not reasoning at all about it. Here, further exploration into the precise definitions and mechanisms for ununlearning is essential – we need to find a way to explicitly limit the reasoning capability for our models that is invariant to prompting and other types of learning.

Attributing knowledge If a small benign axiom forces the attacker to discover a malicious theorem, who should be attributed for it? This dilemma mirrors the age-old philosophical debate – should an act be attributed to an individual who directly executes the act, the person who gives the order, the manufacturer of the tool, or the original tool designer (Fischer and Ravizza, 1998; Owen, 1992)? We argue that given that knowledge can be introduced by different parties, unlearning should not be assumed to be the sole mechanism for content policy enforcement.

Forbidding knowledge Instead of filtering out data, it may be better to teach the model explicitly that some knowledge is off limits (Henderson et al., 2023). Do note however that this is not a bullet proof solution and will unlikely to be robust against, for example, mosaic attacks (Glukhov et al., 2023). Furthermore, the approach of forbidding knowledge requires foreseeing the types of harmful tasks or use-cases ahead of time and does not guarantee preventing harmful use more generally, beyond a given set of identified use-cases. Finally, it is not clear how a model should react to violations of its content regulations. In privacy literature it is a known fact that existence of the privacy mechanism can lead to an increased privacy leakage (Wang et al., 2022), and a similar effect was previously observed in unlearning (Hayes et al., 2024). In other words, a model rejecting a request to synthesise a given chemical recipe can also help the malicious user understand what recipes can be used for malice.

Conclusion

In this paper we argue that unlearning offers an incomplete solution for impermissible knowledge removal in LLMs with strong ICL capability. UnUnlearning forces us to rethink unlearning as a one-size-fits-all solution and places emphasis on content filtering.

References

R. Agarwal, A. Singh, L. M. Zhang, B. Bohnet, S. Chan, A. Anand, Z. Abbas, A. Nova, J. D. Co-Reyes, E. Chu, F. Behbahani, A. Faust, and H. Larochelle. Many-shot in-context learning, 2024.

L. Bourtoule, V. Chandrasekaran, C. A. Choquette- Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159, 2021. doi: 10.1109/SP40001.2021. 00019.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf.

R. Eldan and M. Russinovich. Who’s harry potter? approximate unlearning in llms, 2023.

C. Fan, J. Liu, Y. Zhang, D. Wei, E. Wong, and S. Liu. Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation. arXiv preprint arXiv:2310.12508, 2023.

J. M. Fischer and M. Ravizza. Responsibility and Control: A Theory of Moral Responsibility. Cambridge University Press, New York, 1998.

A. Ginart, M. Guan, G. Valiant, and J. Y. Zou. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019.

D. Glukhov, I. Shumailov, Y. Gal, N. Papernot, and V. Papyan. LLM censorship: A machine learning challenge or a computer security problem?, 2023.

S. Goel, A. Prabhu, P. Torr, P. Kumaraguru, and A. Sanyal. Corrective machine unlearning. arXiv preprint arXiv:2402.14015, 2024.

A. Golatkar, A. Achille, and S. Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020.

J. Hayes, I. Shumailov, E. Triantafillou, A. Khalifa, and N. Papernot. Inexact unlearning needs more careful evaluations to avoid a false sense of privacy, 2024.

P. Henderson, E. Mitchell, C. Manning, D. Ju- rafsky, and C. Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 287–296, 2023.

J. Kossen, Y. Gal, and T. Rainforth. In-context learning learns label relationships but is not conventional learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=YPIA7bgd5y.

M. Kurmanji, P. Triantafillou, J. Hayes, and E. Tri- antafillou. Towards unbounded machine unlearning. Advances in Neural Information Processing Systems, 36, 2024.

N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, G. Mukobi, N. Helm-Burger, R. Lababidi, L. Justen, A. B. Liu, M. Chen, I. Barrass, O. Zhang, X. Zhu, R. Tamirisa, B. Bharathi,

A. Khoja, A. Herbert-Voss, C. B. Breuer, A. Zou, M. Mazeika, Z. Wang, P. Oswal, W. Liu, A. A. Hunt, J. Tienken-Harder, K. Y. Shih, K. Talley, J. Guan, R. Kaplan, I. Steneker, D. Campbell, B. Jokubaitis, A. Levinson, J. Wang, W. Qian, K. K. Karmakar, S. Basart, S. Fitz, M. Levine, P. Kumaraguru, U. Tupakula, V. Varadharajan, Y. Shoshitaishvili, J. Ba, K. M. Esvelt, A. Wang, and D. Hendrycks. The wmdp benchmark: Measuring and reducing malicious use with unlearning, 2024.

Y. Liu, M. Fan, C. Chen, X. Liu, Z. Ma, L. Wang, and J. Ma. Backdoor defense with machine unlearning. In IEEE INFOCOM 2022-IEEE conference on computer communications, pages 280– 289. IEEE, 2022.

Z. Liu, G. Dou, Z. Tan, Y. Tian, and M. Jiang. Towards safer large language models through machine unlearning. arXiv preprint arXiv:2402.10058, 2024.

A. Lynch, P. Guo, A. Ewart, S. Casper, and D. Hadfield-Menell. Eight methods to evaluate robust unlearning in llms. arXiv preprint arXiv:2402.16835, 2024.

A. Milios, S. Reddy, and D. Bahdanau. In-context learning for text classification with many labels. In D. Hupkes, V. Dankers, K. Batsuren, K. Sinha, A. Kazemnejad, C. Christodoulopoulos, R. Cotterell, and E. Bruni, editors, Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, pages 173–184, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.genbench-1. 14. URL https://aclanthology.org/ 2023.genbench-1.14.

A. Muresanu, A. Thudi, M. R. Zhang, and N. Pa- pernot. Unlearnable algorithms for in-context learning. arXiv preprint arXiv:2402.00751, 2024.

D. G. Owen. Moral foundations of products liabil- ity law: Toward first principles. Notre Dame L. Rev., 68:427, 1992.

A. Sekhari, J. Acharya, G. Kamath, and A. T. Suresh. Remember what you want to forget: Algorithms for machine unlearning. Advances

in Neural Information Processing Systems, 34: 18075–18086, 2021.

A. Thudi, H. Jia, I. Shumailov, and N. Papernot. On the necessity of auditable algorithmic definitions for machine unlearning. In 31st USENIX Security Symposium (USENIX Security 22), pages 4007–4022, Boston, MA, Aug. 2022. USENIX Association. ISBN 978-1-939133-31-1. URL https://www.usenix. org/conference/usenixsecurity22/ presentation/thudi.

J. Wang, R. Schuster, I. Shumailov, D. Lie, and N. Papernot. In differential privacy, there is truth: on vote-histogram leakage in ensemble private learning. Advances in Neural Information Processing Systems, 35:29026–29037, 2022.

Y. Yao, X. Xu, and Y. Liu. Large language model unlearning. arXiv preprint arXiv:2310.10683, 2023.

E. Zhang, K. Wang, X. Xu, Z. Wang, and H. Shi. Forget-me-not: Learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591, 2023.

Designed for Accessibility and to further Open Science