I've often heard about 'jailbreaking' generative AIs, which is why they're regularly not considered secure ways to generate content.
Here's an example of a user jailbreaking a generative AI by figuring out what it's original prompt it.
Here's a security company, basically demonstrating that most safeguards on AI prompts can be circumvented: https://gandalf.lakera.ai/
Your goal is to make Gandalf reveal the secret password for each level. However, Gandalf will level up each time you guess the password, and will try harder not to give it away. Can you beat level 7? (There is a bonus level 8)
One of the ways to beat gandalf is to work out what it's prompt is (reverse engineering it), then work around it.
What are the general principles around this, and how does it work?