AI安全並非無稽之談，但我們關注的重點錯了。

Hacker News·3 個月前

本文認為，AI安全確實是個重要議題，但目前的產業過度關注提示注入等表面風險，而忽略了模型本身的穩健性及系統性安全。

AI Security Isn’t Bullshit. But We’re Securing the Wrong Thing.

Listen

Prompt injection isn’t the real risk surface.

I really liked Sander Schulhoff’s recent post arguing that the AI security industry is bullshit. The frustration in that piece is earned. A lot of what currently passes for “AI security” is shallow, overmarketed, and evaluated on metrics that do not reflect real-world risk. Static benchmarks. One-click red teaming. Guardrails presented as controls. It all creates a comforting illusion without much substance behind it…

Of course, models aren’t robust. Security starts when we assume that and design systems accordingly.

I want to start by saying this clearly: I like Sander. I’ve met him a few times. I respect his work, his rigor, and his willingness to say the quiet part out loud when an entire room is pretending everything is fine. He’s done more than most to drag prompt hacking out of obscure corners and into the daylight, and the field is better for it. His recent post arguing that “the AI security industry is bullshit” struck a nerve not because it was wrong, but because it was directionally right and operationally incomplete.

This is not a dunk. It’s a refinement.Because AI security isn’t bullshit.But we are, collectively, securing the wrong thing.

The Part Sander Gets Right

There is an AI security hype bubble. A real one.

Too many vendors promise protection against prompt injection and jailbreaks with the same playbook cybersecurity has been running for years. Glossy decks. Animated diagrams. Big claims about “coverage” and “visibility.” Very little evidence that anything meaningfully changes when a real attacker shows up. The demos are familiar: static evals, canned attacks, dashboards full of comforting green checkmarks that collapse the moment a thinking human starts applying pressure.

If your AI security story begins and ends with:

…then yes, that’s security theater.

Prompt injection is not a bug you patch. It’s a property of how language models work. Treating it like SQL injection is a categorical error.

On that, we agree.

Where the Argument Breaks Down

Where I part ways is the implied conclusion that because some AI security work is shallow, the entire discipline is suspect.

That’s throwing away the fire alarm because some people sell shitty smoke detectors.

The problem isn’t that AI security is fake. The problem is that we anchored the conversation at the wrong abstraction layer.

We focused on model behavior instead of system behavior.

We asked: “Can the model be tricked?”

When the real question is: “What happens when it is?”

Prompt Injection Isn’t the Risk Surface.

Prompt injection is an input. Not an incident.

The real risk surface lives downstream:

A chatbot that hallucinates is embarrassing.

An agent that hallucinates while holding credentials is an incident.

If your AI can:

…then the threat model changes completely.

At that point, prompt injection stops being the story and becomes merely one of many ways an attacker can steer behavior.

This is also where it quietly becomes an identity problem.

An AI agent with tools, memory, and credentials is effectively a new kind of identity in your system. It may not have a username or a badge, but it has access, authority, and the ability to act. Treating agents as anything other than identities, with scoped permissions, clear ownership, and auditable actions, is how small failures turn into real incidents.

We’re Securing the Mouth, Not the Hands

Most current AI security tooling focuses on what the model says.

Very little focuses on what the model does.

That’s backwards.

Get David Campbell’s stories in your inbox

Join Medium for free to get updates from this writer.

In traditional security, we don’t panic because user input exists. We panic when untrusted input reaches privileged execution paths.

AI systems are no different.

The failure mode isn’t:

“The model said something weird.”

It’s:

“The model was trusted to act.”

That trust boundary is where security should live.

Real AI Security Looks Boring (On Purpose)

Real AI security doesn’t look like jailbreak screenshots.

It looks like:

None of that is sexy. All of it matters.

And almost none of it can be solved purely at the model layer.

Offense Still Matters — Just Not as a Party Trick

One more place I disagree: dismissing adversarial testing outright.

Offense is essential. But only when it’s grounded in system context.

Red teaming that stops at “we got the model to say a bad thing” is shallow.

Red teaming that asks:

…that’s where value lives.

The goal isn’t to embarrass models. It’s to surface false confidence.

The Industry Isn’t Bullshit. It’s Early.

We are in the awkward phase where:

That’s not bullshit. That’s adolescence.

The mistake would be to either:

Both are comforting. Both are wrong.

The Frame I’d Offer Instead

AI security isn’t about making models unbreakable. It’s about making systems resilient when they break.

It’s about assuming the model will:

…and designing everything around that fact.

If we shift the conversation from “How do we stop prompt injection?” to “What damage is even possible?”, the industry gets a lot more honest very quickly.

That’s the work worth doing. And that’s the work I’m interested in.

David Campbell is an AI red teamer and technologist focused on model behavior, psychological safety, and long-term alignment risk. He writes about the intersection of human values and machine incentives. You can find him on Twitter, YouTube, & LinkedIn.

Written by David Campbell

AI Security Risk Lead @ Scale AI, known for an AI Red Teaming platform recognized by the U.S. Congress and the White House, champions ethical AI.

Help

Status

About

Careers

Press

Blog

Privacy

Rules

Terms

Text to speech

— Hacker News