Anthropic’s STUNNING New Jailbreak - Cracks EVERY Frontier Model

artesia · 20 December 2024 14:53

The video discusses a new jailbreak technique by Anthropic, known as “best of end jailbreaking” or “shotgunning,” which effectively bypasses restrictions on all Frontier models, including text, audio, and vision, by manipulating prompts through various augmentations. The host emphasizes the technique’s high success rates and its ethical implications, encouraging viewers to explore the open-sourced code to better understand AI vulnerabilities.

artesia · 20 December 2024 15:14

In a recent video, the host discusses a new jailbreak technique released by Anthropic, which is capable of breaking into every Frontier model, including text, vision, and audio models. This technique, referred to as “best of end jailbreaking” or “shotgunning,” allows users to bypass restrictions without needing access to the model’s inner workings. The method involves repeatedly sampling variations of prompts until the desired harmful response is achieved, making it a straightforward yet effective approach.

The host explains that the technique operates as a blackbox algorithm, meaning it can be executed via an API like any regular user would. By employing various augmentations such as random shuffling, capitalization changes, and letter substitutions, users can manipulate prompts to elicit specific responses. The effectiveness of this method is notable, with reported success rates of 89% on GPT-4 and 78% on Claude 3.5 when sampling 10,000 augmented prompts.

The video further elaborates on how this jailbreak technique extends beyond text models to audio and vision models. For audio, users can modify parameters like pitch, volume, and speed, while for vision models, augmentations can include changes in text size, color, and position within images. The host emphasizes that the more variations tested, the higher the likelihood of success, showcasing a power law-like scaling behavior in the attack success rate.

Additionally, the host highlights that combining the shotgunning technique with other existing jailbreak methods enhances its effectiveness. Examples from other known jailbreakers, such as Plyy the Prompter, illustrate how similar strategies have been employed to manipulate AI models successfully. The video presents various examples of text, audio, and vision augmentations that demonstrate the versatility and effectiveness of the shotgunning method.

Finally, the host addresses the ethical implications of sharing this information, arguing that understanding these jailbreaks is crucial for recognizing the vulnerabilities of AI models. They assert that these techniques are not merely bugs but inherent features of non-deterministic models. The video concludes by encouraging viewers to experiment with the open-sourced code provided by Anthropic, emphasizing the importance of awareness regarding these attack vectors for public good.