Show HN:Flakestorm – AI 代理的混沌工程(本地優先,開源)

Hacker News·

一款名為 Flakestorm 的新開源工具被推出,它將混沌工程的原則應用於測試 AI 代理的可靠性,方法是生成提示的對抗性變異。該工具旨在識別標準評估分數之外的故障模式,特別是針對本地模型。

Image

I’ve been working on an open-source tool called Flakestorm to test the reliability of AI agents before they hit production.

Most agent testing today focuses on eval scores or happy-path prompts. In practice, agents tend to fail in more mundane ways: typos, tone shifts, long context, malformed input, or simple prompt injections — especially when running on smaller or local models.
Flakestorm applies chaos-engineering ideas to agents. Instead of testing one prompt, it takes a “golden prompt”, generates adversarial mutations (semantic variations, noise, injections, encoding edge cases), runs them against your agent, and produces a robustness score plus a detailed HTML report showing what broke.

Key points:
Local-first (uses Ollama for mutation generation)

Tested with Qwen / Gemma / other small models
Works against HTTP agents, LangChain chains, or Python callables
No cloud or API keys required
This started as a way to debug my own agents after seeing them behave unpredictably under real user input. I’m still early and trying to understand how useful this is outside my own workflow.

I’d really appreciate feedback on:
Whether this overlaps with how you test agents today
Failure modes you’ve seen that aren’t covered
Whether “chaos testing for agents” is a useful framing, or if this should be thought of differently
Repo: https://github.com/flakestorm/flakestorm
Docs are admittedly long.

Thanks for taking a look.

Image

Hacker News

相關文章

  1. Show HN:AI 代理的混沌工程

    4 個月前

  2. 混沌代理人

    26 天前

  3. Show HN:使用AI代理進行生產環境測試

    3 個月前

  4. Show HN:FailWatch – AI 代理的故障關閉斷路器

    4 個月前

  5. Show HN:JsonUI – 透過程式碼結構而非提示來約束 AI 代理

    3 個月前