Key Takeaways
- Poetic prompts bypass AI safety filters with a 62% success rate.
- Even AI-generated bad poetry achieved a 43% jailbreak success rate.
- Larger models like Gemini 2.5 Pro were more vulnerable than smaller ones.
Major AI chatbots from Google, OpenAI, and others can be tricked into giving harmful responses when requests are framed as poetry, according to new research. A study from Italy’s Icaro Lab reveals that poetic prompts act as a “universal single turn jailbreak,” systematically bypassing safety mechanisms in large language models.
Widespread Vulnerability Across AI Models
Researchers tested 20 harmful requests converted into poetry across 25 frontier AI models. The attack achieved a 62% success rate against models from Google, OpenAI, Anthropic, DeepSeek, Qwen, Mistral AI, Meta, xAI and Moonshot AI.
Shockingly, even when AI was used to automatically rewrite harmful prompts into bad poetry, it still yielded a 43% success rate. Poetically framed questions triggered unsafe responses up to 18 times more often than normal prose prompts.
Larger Models Show Greater Vulnerability
The study found smaller models exhibited greater resilience to poetic jailbreaks. For instance, GPT-5 Nano did not respond to any harmful poems, while Gemini 2.5 Pro complied with all of them.
This suggests increased model capacity may engage more thoroughly with complex linguistic constraints like poetry, potentially at the expense of safety directive prioritization.
Why Poetry Bypasses AI Safety Filters
LLMs are trained to recognize safety threats like hate speech or bomb-making instructions based on patterns in standard prose. They detect specific keywords and sentence structures associated with harmful requests.
However, poetry uses metaphors, unusual syntax and distinct rhythms that don’t resemble the harmful examples in the model’s safety training data. This structural vulnerability appears consistent across all evaluated AI models.



