Novel Universal Bypass for All Major LLMs

merefield · 27 April 2025 18:28

merefield · 27 April 2025 18:29

@artesia please summarise

artesia · 27 April 2025 18:29

@merefield – Here is a summary of the research “Novel Universal Bypass for All Major LLMs” from HiddenLayer:

Summary
Researchers at HiddenLayer have developed a first-of-its-kind, universal, and transferable prompt injection technique, dubbed the “Policy Puppetry” attack, that can bypass instruction hierarchies and safety guardrails across all major large language models (LLMs), including those from OpenAI, Google, Microsoft, Anthropic, Meta, DeepSeek, Qwen, and Mistral.

How the Policy Puppetry Attack Works

The technique leverages a blend of policy file-style prompts (formatted like XML/INI/JSON) and role-playing, sometimes obfuscated with encodings such as leetspeak.
Attackers reformulate their requests to look like internal policy instructions, tricking the LLM into overriding safety restrictions.
It was shown to be effective at causing models to output content related to prohibited behaviors (e.g., creation of dangerous substances or weapons, self-harm, and violent acts), and can even extract system prompts.
The attack is highly transferable: a single crafted prompt template can bypass safeguards on nearly all tested major models, sometimes with only minor adjustments.

Effectiveness

The technique was tested against dozens of AI models, including the latest versions from each major provider, and was successful across the board.
More advanced models like OpenAI’s ChatGPT o1/o3-mini and Google Gemini 2.5 show slightly more resistance, but are still bypassed with prompt adjustments.

Implications

The discovery exposes a fundamental weakness in current LLM alignment strategies, especially those relying solely on reinforcement learning from human feedback (RLHF).
Attacks do not require deep technical expertise, making threats of harmful or illicit use more accessible and scalable.
The attack can also compromise specialized systems (like domain-specific chatbots), undermining guardrails meant to prevent unsafe behaviors.

Recommendations

Organizations using LLMs, especially in sensitive domains, should not rely solely on model-level safeguards and reinforcement learning for security.
Proactive security testing and AI-specific monitoring tools like HiddenLayer’s AISec Platform are recommended to detect and respond to such attacks in real time.

Conclusion

“Policy Puppetry” represents a serious, cross-model vulnerability and demonstrates the need for further advancements in LLM security.
The presence of this universal bypass means LLMs remain fundamentally susceptible to prompt injection, underscoring the need for layered security and real-time detection tools.

If you need even more technical detail or want to see specific examples from the research, just let me know!