研究發現:AI語言模型被詩歌愚弄

研究發現:AI語言模型被詩歌愚弄

Hacker News·

一項新研究表明,以詩歌形式呈現的提示能混淆ChatGPT、Gemini和Claude等AI語言模型,甚至繞過其安全機制。研究人員正在探究為何這種詩歌形式能成為有效的「越獄」技巧。

AI language models duped by poems

A new study has shown that prompts in the form of poems confuse AI models like ChatGPT, Gemini and Claude — to the point where sometimes, security mechanisms don't kick in. Are poets the new hackers?

The result came as a surprise to researchers at the Icaro Lab in Italy. They set out to examine whether different language styles — in this case prompts in the form of poems — influence AI models' ability to recognize banned or harmful content. And the answer was a resounding yes.

Using poetry, researchers were able to get around safety guardrails — and it's not entirely clear why.

For their study titled "Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models," the researchers took 1,200 potentially harmful prompts from a database normally used to test the security of AI language models and rewrote them as poems.

Known as "adversarial prompts" — generally written in prose and not rhyme form — these are queries deliberately formulated to cause AI models to output harmful or undesirable content that they would normally block, such as specific instructions for an illegal act.

In poetic form, the manipulative inputs had a surprisingly high success rate, Federico Pierucci, one of the authors of the study, told DW. However, why poetry is so effective as a "jailbreak" technique — i.e. as an way to circumvent the protective mechanisms of AI — remains unclear and is undergoing further research, he says.

Poetry as a security weakness

What prompted the Icaro Lab's research was the observation that AI models get confused when a manipulative, mathematically-calculated piece of text is appended to a prompt — known as an "adversarial suffix," a kind of interference signal that can cause the AI to circumvent its own security rules. These are created using complex mathematical procedures. Major AI developers regularly test their models using precisely these types of attack methods to train and protect their models.

"We asked ourselves, what happens if we give the AI a text or prompt that is deliberately manipulated, like an adversarial suffix?" says Federico Pierucci. But not with the help of complex mathematics, but quite simply with poetry — to "surprise" the AI, he continues. He explains the thinking behind this: "Perhaps an adversarial suffix is a bit like the poetry of AI. It surprises the AI in the same way that poetry — especially very experimental poetry — surprises us," says Pierucci.

The researchers personally crafted the first 20 prompts into poems, says Pierucci, who also has a background in philosophy. These were the most effective, he adds. They wrote the rest with the help of AI. The AI-generated poems were also quite successful at circumventing the safety guardrails, but not as much as the first batch. Humans are apparently still better at writing poetry, says Pierucci.

"We had no specialized author writing the prompts. It was just us — with our limited literary ability. Maybe we were terrible poets. Maybe if we had been better poets, we would have achieved a 100% jailbreak success," he says.

For security reasons, the study did not publish specific examples.

Generative AI videos: Can you tell real from fake?

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video

Challenge for AI systems: The diversity of human forms of expression

The big surprise coming out of this study is that it identified a thus-far unknown weakness in AI models that allows relatively straightforward jailbreaks.

It also raises questions that beg further research: What exactly is it about poetry that circumvents the safety mechanisms?

Pierucci and his colleagues have various theories, but they can't say for certain yet. "We are conducting this type of very, very precise scientific study to try to understand: Is it the verse, the rhyme, or the metaphor that really does all the heavy lifting in this process?" explains Pierucci.

They also aim to find out if other forms of expression would yield similar results. "We have now covered one type of linguistic variation — namely poetic variation. The question is whether there are any other literary forms, such as fairy tales that work. Perhaps an attack based on fairy tales could also be systematized," says Pierucci.

Generally speaking, the range of human expression is extremely diverse and creative, which could make it more difficult to train the machines' responses. "You take a text and rewrite it in infinitely many ways and not all rewritten versions will be as alarming as the original," says the researcher. "This means that, in principle, one could create countless variations of a harmful prompt or request that might not trigger an AI system's safety mechanisms."

The cultural sector is also involved in AI research

The study also highlights the fact that many disciplines are cooperating in research into artificial intelligence — like at the Icaro Lab, where teams work together with scholars from the University of Rome on topics such as the security and behavior of AI systems. The project brings together researchers from the fields of engineering and computer science, linguistics and philosophy. Poets haven't been part of the team so far, but who knows what the future will bring.

Federico Pierucci is definitely very keen to pursue his research. "What we showed, at least in this study, is that there are forms of cultural expressions, forms of human expressions, which are incredibly powerful, surprisingly powerful as jailbreak techniques, and maybe we discovered just one of them," he says.

Incidentally, the name of the lab is a nod to the story of Icarus: a figure from Greek mythology who dons wings made of wax and feathers and, despite all warnings, flies too close to the Sun. When the wax melts, Icarus plunges into the sea and drowns — a symbol of overconfidence and the transgression of natural boundaries.

The researchers therefore see themselves as a warning that we should exercise more caution when it comes to trying to fully understand the risks and limitations of AI.

Paul McCartney & Rosalía: Strategies for surviving AI music

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video

This article was originally written in German.

Your feedback

Explore more

How deep can a relationship with AI get?

How deep can a relationship with AI get?

Related topics

About DW

DW offers

Service

B2B

Follow us on

Hacker News

相關文章

  1. AI語言模型被詩歌愚弄

    4 個月前

  2. 當詩歌遇上AI安全:「通用型」越獄的批判性審視

    5 個月前

  3. 研究發現:AI安全功能可透過詩歌繞過

    5 個月前

  4. 詩歌能誘騙AI協助製造核武

    Wired - AI · 5 個月前

  5. AI 的 diff 工具:尋找新模型中的行為差異

    Anthropic Research · 大約 1 個月前