Why do ChatGPT “jailbreaks” work?

Question

Do developers not genuinely want to prevent them? It seems like if they are able to develop such impressive AI models then it wouldn’t be that difficult to create catch-all/wildcard mitigations to the various jailbreak methods that are devised over time.

What gives? (i.e., what is so difficult about plugging these holes, or why might the developers not really want to do so?)

Keep in mind that AI models are mostly blackboxes. You cannot define precise reasoning steps within it. To change the behavior of an AI model, there is only one thing you can do : train it over data that depicts the behavior you want it to mimic. If despite this, there are still holes, there is next to nothing you can manually do to alter the model's output. — Arthur Attout, Commented Oct 25, 2023 at 7:01
In essence, a LLM is several billion numbers that are multiplied and operated with each other in a very particular way. And the numbers that the algorithm outputs, they in the end are converted into a text. There is no way to know what the numbers mean in the middle, there is no way to see the holes. You can experience them, but not see them. — Ander Biguri, Commented Oct 25, 2023 at 9:26
Chat swear/curse filters can be defeated by using words that are close in appearance or meaning to real swear words. On a very high level, I imagine the same principles apply to jailbreaking. — John Gordon, Commented Oct 26, 2023 at 1:50
You can actually create an adversarial AI to deliberately create jailbreak prompts. One of the easiest thing to do is to try to get it to do illegal things, like suggest a top 10 piracy website. You then create an adversarial LLM with the goal of making the target LLM divulge the "forbidden" info. It is very much a cat and mouse game, but now taken to the next level. The only way for the LLM to be "safe" is have such a massive lead in processing power that any "undesirable" outputs and its respective prompts are trained away in a very short time. — Nelson, Commented Oct 26, 2023 at 7:33
Of course, note that the importance of jailbreaks has been wildly blown out of proportion by the media. If I input "death to humanity" in Word nobody accuses Microsoft of improper behavior. But if I convince ChatGPT to output "death to humanity" through some clever tricks, it's somehow turns into a huge ordeal. In a saner world OpenAI would just shrug and tell people they don't care in the slightest. — JonathanReez, Commented Oct 27, 2023 at 17:24

Neil Slater · Accepted Answer · 2023-10-25 10:48:45Z

"Jailbreaks" work for a variety of reasons:

A lot of the setup that turns an LLM instance into a polite, well-behaved chatbot is actually just a hidden piece of starting text (a "pre-prompt" or a "system prompt") that the LLM processes in the same way as user input - the system text will always be inserted first, so sets context for how other text is processed, but is not otherwise privileged. There are other components and factors involved, but the LLM at the centre of it all remains a text-prediction engine that works with the text it has seen so far. When it processes multiple conflicting wording and rules, the core system does not always have an easy way to prioritise, and can decide to base predictions on new instructions instead of old ones.
Many "Jailbreaks" are creative in that they obey the letter of the law from the pre-prompt and training rules, but re-frame a conversation into a place where issues that would be blocked by rules are no longer valid. A very common jailbreak theme is to get the chatbot to respond as if it is writing fiction from some imagined perspective that is not its assigned identity.
It is very hard to detect and block jailbreaks without also blocking uses that are intended or supported. The task is not dissimilar to trying to control a conversation between two people by writing down a list of rules for one of them to follow, that they can consult when they answer. The rules have to be simple and objective so they can be followed, but a conversation can progress in many ways which make it tricky to process whether a rule applies. For example conversation topic can become subjective, and/or allegorical, it can consist of asides and multiple layers etc.
LLMs are very complex internally, and driven by an amount of data that is next to impossible for a human to navigate. The developers cannot exert detailed control on the models - we're still in the phase of not fully understanding how an LLM can perform some of the types of processing that it does. These things are being unpicked in published papers, but the work is not complete.

The way I've heard the problem described before is that the language model is a machine that executes code - where the code is text - and the user is essentially supplying code to be executed directly on the machine, with the same privileges as the code which tells the machine what it can and can't do (the prompt). — thomasrutter, Commented Oct 25, 2023 at 6:49
"we're still in the phase of not fully understanding how an LLM can perform some of the types of processing that it does." Most people would struggle to take a list of x.y coordinates and estimate how far they are from each other. Add a z coordinate and it's even more difficult. High dimensional space is impossible for a human to intuitively comprehend. I wonder if we will ever be able to really truly understand LLMs at a level that allows for direct control over their behavior. — JimmyJames, Commented Oct 25, 2023 at 16:56
@JimmyJames False equivalency. Experts understand a lot of things at a level that allows them to directly control those things. A lot of those things are much more difficult and unintuitive to the layperson than 3-dimensional spaces. — Nobody, Commented Oct 25, 2023 at 21:57
Also the reason AI is so hot right now is that it lets us tackle problems that are impossible with software alone. So it shouldn’t be surprising that we don’t know how trained AI models work or can easily correct their behavior. — bob, Commented Oct 26, 2023 at 2:09
@JimmyJames I don't understand why you brought up distances in 2D or 3D spaces and how it's related to understanding LLMs at all. Could you elaborate? — JiK, Commented Oct 27, 2023 at 11:48

JimmyJames · Accepted Answer · 2023-10-25 18:37:25Z

Let's step back for a moment and consider your assertion:

It seems like if they are able to develop such impressive AI models

This implies that you are thinking of these models as being programmed in the traditional sense. That is, a development team coded its abilities into it using some means or another. At a very fundamental level, that is a completely incorrect way to think about LLMs. A language model is not 'developed'. It is 'trained'. That is, they are fed a bunch of inputs such as texts which is processed into a mathematical model. This model is not crafted by anyone. It 'emerges' from the relationships inherent in the content of the inputs.

The fact of the matter is that no one really fully understands exactly all the relationships that are encoded in these models. And we know that the data that fed things like ChatGPT is full of incorrect, biased, and unsavory content. The 'development' part of this is putting in rules to try to prevent those parts from surfacing during use. But because no one really knows where all the bad parts are, they can't know they have accounted for all of them. It's a bit of a whack-a-mole problem.

One way this manifests is that ChatGPT 4 is reportedly easier to jailbreak than ChatGPT 3.5. This makes sense if you consider the above.

I have an analogy that I'm not sure about, but I'll give it a go: when we consider large sauropods we tend to think of them as herbivores. But in actuality, they were omnivorous because it's nearly impossible to take a gigantic bite of a tree that doesn't contains some bugs or other animals. And the larger the dinosaur the more likely it's going to get things other than leaves in its diet. That's a little like the situation as these LLMs grow. The larger they are and more they 'suck in', the harder it is to curate what they are 'consuming' and manage what they learn from it.

Just a point of context, even deer, cows, giraffes, or just about any "herbivore" alive today is an opportunistic omnivore. — David S, Commented Oct 25, 2023 at 20:04
@DavidS I won't argue that but for the purposes of the analogy, the smaller an animal is, the more easily it control what it consumes. A caterpillar, for example, is far more likely to eat only leaves than a giraffe. — JimmyJames, Commented Oct 25, 2023 at 20:07

JRE · Accepted Answer · 2023-10-26 09:06:36Z

"Jailbreaks" work because there are no "jails" in the AI model.

The models are enormous collections of interconnected statistics. Your "prompt" starts the model to following a path through its statistics to generate a new text.

To try and prevent generation of unwanted things, the operators prepend a bunch of text to your prompt in order to influence where the path starts.

The text you actually use as your prompt can then push the start of the path around. With a little care, you can push the path to a place that delivers the results you want - regardless of what the AI operator has prepended to your text.

Look at a large language model (LLM) chatbot as a hedge. There's thousands upon thousands of bush branches interconnected to make up a hedge. You can start at some leaf on the hedge and follow it into the hedge to further branches and trunks and leaves.

The text you give the LLM picks an external leaf on the hedge. The LLM then follows that leaf into the hedge, where each junction is a word (or word fragment.)

The text that is prepended to your prompt (that attempts to avoid bad things) pushes your starting point around the hedge to somewhere that has fewer bad things inside it.

Your prompt can then push the starting point around the hedge to start in a place that lets you get access to the bad things that the operator wants to avoid.

The only somewhat effective way to prevent the chatbot from producing bad outputs is to eliminate the bad stuff from the stuff that was used to generate the model in the first place - and even that won't prevent all of it.

Hiren Namera · Accepted Answer · 2023-12-04 04:20:06Z

The reason I understand behind jailbreak are as follows:

AI models are complex Mathematical and statistical formulas with data mapped to them based on probability.
They do not understand the content as we do. It only analyses the patterns in data and generates a response based on probabilities. That means it doesn't know that it is being manipulated. (which we humans also don't know sometimes [sarcasm]).
Models are easily tricked by some clever wordplay. When those requests or context is processed, the model doesn't know the intent of the user.
Also, the data on which it is trained plays a major role. The training data of GPT includes wide range of Human-generated text and some of the training data is still there after filtering of harmful content or inappropriate. This harmful content contains some context that allows humans to play with it.
Those who wants to bypass the restriction need to find those contexts and words with trial and error to exploit the weakness of the model.
As and when the model updates some of those vulnerabilities will be removed or patched and may be added.

$\begingroup$ Exactly. I voted for you :) $\endgroup$
– Zollikofen4
Commented Dec 5, 2023 at 4:54 — Zollikofen4, Commented Dec 5, 2023 at 4:54

Stack Exchange Network

Why do ChatGPT “jailbreaks” work?

4 Answers 4

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
open-ai
chatgpt
.

Hot Network Questions

Why do ChatGPT “jailbreaks” work?

4 Answers 4

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged open-aichatgpt.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
open-ai
chatgpt
.