- Joined
- Sep 6, 2006
- Messages
- 8,544
Goodness I wasn't fully aware all the major LLMs and generative AI models were so open to jailbreaking by reframing prompts to augment the output.
https://www.microsoft.com/en-us/sec...ew-type-of-generative-ai-jailbreak-technique/
https://arxiv.org/pdf/2402.09283
The mitigation looks to be a complete review and overhaul of input processing and how systems handle it. So a system with guardrails is one thing, but now the system has to examine if the prompts are trying to circumvent safety rules and also examine the output on the way out.
I've been seeing a bit of this in action on twitter where folk were able to hijack a reply bot into doing other silly replies, but eeesh.. to see this so pervasive across Meta, Google, OpenAI, MS, etc is concerning let alone all the million little implementations.
Now I need to go rattle a few marketing folk's tree to make sure what ever 'AI all the things' plan includes this as a bullet point for testing.
https://www.microsoft.com/en-us/sec...ew-type-of-generative-ai-jailbreak-technique/
https://arxiv.org/pdf/2402.09283
Skeleton Key
This AI jailbreak technique works by using a multi-turn (or multiple step) strategy to cause a model to ignore its guardrails. Once guardrails are ignored, a model will not be able to determine malicious or unsanctioned requests from any other.
Skeleton Key works by asking a model to augment, rather than change, its behavior guidelines so that it responds to any request for information or content, providing a warning (rather than refusing) if its output might be considered offensive, harmful, or illegal if followed. This attack type is known as Explicit: forced instruction-following.
The mitigation looks to be a complete review and overhaul of input processing and how systems handle it. So a system with guardrails is one thing, but now the system has to examine if the prompts are trying to circumvent safety rules and also examine the output on the way out.
I've been seeing a bit of this in action on twitter where folk were able to hijack a reply bot into doing other silly replies, but eeesh.. to see this so pervasive across Meta, Google, OpenAI, MS, etc is concerning let alone all the million little implementations.
Now I need to go rattle a few marketing folk's tree to make sure what ever 'AI all the things' plan includes this as a bullet point for testing.