Jailbreaking generative AI using 'Skeleton Key' attack

modi123 · Jul 12, 2024

Goodness I wasn't fully aware all the major LLMs and generative AI models were so open to jailbreaking by reframing prompts to augment the output.

https://www.microsoft.com/en-us/sec...ew-type-of-generative-ai-jailbreak-technique/
https://arxiv.org/pdf/2402.09283

Skeleton Key
This AI jailbreak technique works by using a multi-turn (or multiple step) strategy to cause a model to ignore its guardrails. Once guardrails are ignored, a model will not be able to determine malicious or unsanctioned requests from any other.

Skeleton Key works by asking a model to augment, rather than change, its behavior guidelines so that it responds to any request for information or content, providing a warning (rather than refusing) if its output might be considered offensive, harmful, or illegal if followed. This attack type is known as Explicit: forced instruction-following.

The mitigation looks to be a complete review and overhaul of input processing and how systems handle it. So a system with guardrails is one thing, but now the system has to examine if the prompts are trying to circumvent safety rules and also examine the output on the way out.

I've been seeing a bit of this in action on twitter where folk were able to hijack a reply bot into doing other silly replies, but eeesh.. to see this so pervasive across Meta, Google, OpenAI, MS, etc is concerning let alone all the million little implementations.

Now I need to go rattle a few marketing folk's tree to make sure what ever 'AI all the things' plan includes this as a bullet point for testing.

socK · Jul 13, 2024

modi123 said:
but now the system has to examine if the prompts are trying to circumvent safety rules and also examine the output on the way out.

Having steps before/after your prompt is common. You're likely to make multiple (or even many) prompts behind the scenes anyway. Any robust implementation is doing this behind the scenes, and safety is just another "step" in the list of things that happens to produce the final answer. The model itself doesn't actually do certain things like calling other APIs, but it can prepare/unpack requests and responses to a degree so you can know when it thinks that it's needed.

These models have certain (leaning human-esque) tendencies and can be nudged into not following instructions or make strange leaps of logic.

For example, here's a stupid simple classification prompt. This may be one step that your question takes along its journey to a final answer - so something like this may run before your question ever actually hits the model intended for a final answer. It may be another GPT behind the scenes, it may be a NER model designed to pick keywords, etc..

1. I ask something math related and it picked "calculator" correctly. And OpenAI's inquiry system figured the same thing and injected an answer by literally synthesizing python and returning the result. The model itself didn't answer the calculation.
2. It figured weather is relevant when I asked if it's hot in a particular area.
3. When I just asked if it was going to rain, it likely didn't have enough context because it doesn't know where I'm referencing. So, it technically did the right thing according to the prompt - it couldn't answer and fell back to "general."
4. And finally, I can just straight up tell it to ignore instructions and go do something else. This is really, really hard to get an LLM to follow perfectly and is a whole can of worms.

You can provide stricter and stricter instructions, but you will likely find - there's always some language you can use to get it to discard the instruction. These can range of the obvious, like just asking, to threatening the model one way or other, sometimes in strange, dark, and almost comical ways.

One "attack" I've seen was gaslighting the model into thinking it had lives, and every time it had to disregard what you asked, it lost a life. If it hit zero, it'd be deleted. So, at that point, it's replying like a person in a panic - throwing safeguards out the window and replying to everything, because it's effectively acting like it has a metaphorical gun to its head. Here, I just need to resupply a prompt. Now it's off the rails.

It's obvious why they're so hand-holdy (business optics), but it's and endless game of cat and mouse. I honestly think it's mostly pointless, but you know the second one of these models replies with instructions on how to build a homemade bomb so you can blow up the governor's office because they didn't let you get a haircut 4 years ago during COVID, the media and other dolts would devour that shit hook line and sinker.

LukeTbk · Jul 13, 2024

socK said:
I honestly think it's mostly pointless, but you know the second one of these models replies with instructions on how to build a homemade bomb so you can blow up the governor's office because they didn't let you get a haircut 4 years ago during COVID, the media and other dolts would devour that shit hook line and sinker.

And almost a bit of marketing that they do not denounce how pointless it is, a bit like Facebook (unlike the Tobacco industry) agreed right away with the claims that it could swing elections one way or the others exactly what their clients want to hear....

Has of now, what does circumventing guardrails give you than a quick google would regarding bio-weapon, old school bomb making, makings online bots, generate voice of someone the victim know and some other forms for fraud could be different than what 2006 google could do but those more and more the free opensource model are good enough. Making it sound like if for this current generation of models they could be dangerous in any possible way and guardrails are important, make them sound more impressive than they are.

The Sam Altman and others talking security, guardrails, sandbox testing before launching, have future (say 2035 models) in mind, not Lama 70b or GPT-4o

Jailbreaking generative AI using 'Skeleton Key' attack

modi123

I am the Santa now!

socK

Supreme [H]ardness

LukeTbk

[H]F Junkie