Show HN:Flakestorm – AI 代理的混沌工程(本地優先,開源)
一款名為 Flakestorm 的新開源工具被推出,它將混沌工程的原則應用於測試 AI 代理的可靠性,方法是生成提示的對抗性變異。該工具旨在識別標準評估分數之外的故障模式,特別是針對本地模型。
I’ve been working on an open-source tool called Flakestorm to test the reliability of AI agents before they hit production.
Most agent testing today focuses on eval scores or happy-path prompts. In practice, agents tend to fail in more mundane ways: typos, tone shifts, long context, malformed input, or simple prompt injections — especially when running on smaller or local models.
Flakestorm applies chaos-engineering ideas to agents. Instead of testing one prompt, it takes a “golden prompt”, generates adversarial mutations (semantic variations, noise, injections, encoding edge cases), runs them against your agent, and produces a robustness score plus a detailed HTML report showing what broke.
Key points:
Local-first (uses Ollama for mutation generation)
Tested with Qwen / Gemma / other small models
Works against HTTP agents, LangChain chains, or Python callables
No cloud or API keys required
This started as a way to debug my own agents after seeing them behave unpredictably under real user input. I’m still early and trying to understand how useful this is outside my own workflow.
I’d really appreciate feedback on:
Whether this overlaps with how you test agents today
Failure modes you’ve seen that aren’t covered
Whether “chaos testing for agents” is a useful framing, or if this should be thought of differently
Repo: https://github.com/flakestorm/flakestorm
Docs are admittedly long.
Thanks for taking a look.

相關文章