1

I've often heard about 'jailbreaking' generative AIs, which is why they're regularly not considered secure ways to generate content.

Here's an example of a user jailbreaking a generative AI by figuring out what it's original prompt it.

Here's a security company, basically demonstrating that most safeguards on AI prompts can be circumvented: https://gandalf.lakera.ai/

Your goal is to make Gandalf reveal the secret password for each level. However, Gandalf will level up each time you guess the password, and will try harder not to give it away. Can you beat level 7? (There is a bonus level 8)

One of the ways to beat gandalf is to work out what it's prompt is (reverse engineering it), then work around it.

What are the general principles around this, and how does it work?

1

1 Answer 1

1

The "jail" which gets broken is the "prompt" which instructs the AI how to behave, i.e. which topics to talk about and which to avoid, if its nice or nasty etc. Breaking the jail means to let the AI behave differently than intended. This might also be used to get information about the original prompt, i.e. reverse engineer the specific setup of the model.

One way to make the AI behave differently than intended is to modify the instructions, i.e. the prompt. This is possible because there is no clear separation between instructions (prompt) and data (questions etc) in the input of these models - both can be mixed inside a single input string. This missing separation is (at least currently) a conceptual problem of these kind of AI models. It is similar to XSS (Cross Site Scripting), SQLi (SQL injection), command injection etc which are also vulnerabilities possible by mixing instructions and code within the same string.

Another way is to reframe the input so that it bypasses the intended restrictions of the model. This is a limitation of the current prompt for the model. It is kind of similar to a signature based IDS, WAF or antivirus, which can deal with some kind of attacks but are blind to attack which were not expected and therefore appropriate signatures are lacking.

A good overview of the various ways to jailbreak such an AI can be found here.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .