驗證債務:當生成式AI加速變革的速度超越了證明能力
生成式AI正加速軟體開發,產生一種「驗證債務」,即變革的速度超過了在真實世界條件下證明其安全性與正確性的能力。這種失衡帶來了未知的風險,因為儘管變革充裕,但安全性的證據卻變得稀少。

Latest Issue
Previous Issue
Topics
Sections
Magazine
CACM Web Account
Membership in ACM includes a subscription to Communications of the ACM (CACM), the computing industry's most trusted source for staying connected to the world of advanced computing.
Communications of the ACM
Follow Us
Verification Debt: When Generative AI Speeds Change Faster Than Proof
The gap between the software you released and what you have demonstrated is unknown risk.

Software delivery has always lived with an imbalance. It is easier to change a system than to demonstrate that the change is safe under real workloads, real dependencies, and real failure modes. Before generative AI, this imbalance was partly moderated by a practical constraint: human attention. People wrote the code, people reviewed it, automated tests ran on every change, and a limited rollout exposed new behavior to real traffic before it went wide.
Generative AI can loosen that tether in teams that lean on it heavily. When the marginal cost of drafting changes drops, more changes can arrive for review, more interactions can demand integration testing, and production may see more change, even when each change is small. The risk is not that teams become careless. The risk is that what looks correct on the surface becomes abundant while evidence remains scarce.
This matters because adoption has crossed into habit. In the Stack Overflow 2025 Developer Survey, 84% of respondents said they use or plan to use AI tools in their development process, and 51% of professional developers use them daily. When a tool becomes daily habit, it stops being a preference and becomes a property of the delivery system. It reshapes what gets reviewed, what gets tested, what gets merged, and what gets shipped.
What verification debt means
A useful name for what accumulates in the mismatch is verification debt. It is the gap between what you released and what you have demonstrated, with evidence gathered under conditions that resemble production, to be safe and resilient. Technical debt is a bet about future cost of change. Verification debt is unknown risk you are running right now.
Here, verification does not mean theorem proving. It means evidence from tests, staged rollouts, security checks, and live production signals that is strong enough to block a release or trigger a rollback. It is uncertainty about runtime behavior under realistic conditions, not code cleanliness, not maintainability, and not simply missing unit tests.
If you want to spot verification debt without inventing new dashboards, look at proxies you may already track. How quickly do you detect regressions after a canary rollout (a small rollout to a limited slice of traffic). How often do you roll back after expanding traffic. How many rollouts expand without stable live signals in view. Track them explicitly as time to detect a regression after canary, time to a safe rollback, and the share of expansions that happen without a defined signal panel visible to the person approving the expansion. Those measures are imperfect, but they tell you whether evidence is keeping pace with output.
What the evidence is starting to show
The texture of failure changes as well. In an AI-assisted workflow, the artifacts can look clean, and explanations can sound confident, even when the underlying behavior is wrong in ways that only appear under real interactions. A clean diff can reduce perceived risk. That means subtle regressions can travel further before they are caught. The surprise tends to arrive in the seams, the slow path, the integration edge case, the timeout cascade.
There is early evidence that the bottleneck is already shifting from writing to verifying. METR ran a randomized controlled trial on experienced open source developers working on their own repositories and found that allowing usage of early 2025 AI tools increased completion time by 19 percent on average. Before starting tasks, developers forecast that AI would reduce completion time by 24%, and after completing tasks, they estimated AI reduced completion time by 20%. This gap does not identify where time went. It is consistent with a hypothesis many teams recognize: drafting can get faster while integration and validation time grows.
Security is the most checkable example of why what looks correct on the surface is not proof. In Veracode’s evaluation of AI-generated code, across more than 100 large language models and multiple languages including Java, Python, C#, and JavaScript, 45% of code samples failed security tests and introduced OWASP Top 10 vulnerabilities. Veracode also reported that Cross Site Scripting, CWE 80, failed in 86% of relevant code samples. This does not mean AI always writes insecure code. It means something more useful for working engineers: output can look polished, work functionally, and still be unsafe. When generation scales faster than validation, verification debt can translate into security debt once the system meets real users and real attackers.
Why verification does not scale like generation
A reasonable question follows. If we have generative AI, why do we not have verification AI that scales at the same pace and neutralizes the problem?
Because verification is not the same kind of problem as generation.
Generation produces plausible artifacts. Verification needs an oracle, a reliable way to know the correct output for a given input in a given environment. In many systems, the oracle is incomplete because requirements are ambiguous, context dependent, or only revealed at runtime. The oracle problem is surveyed in detail by Barr and colleagues.
In other words, you can generate tests quickly, but without a reliable oracle, you cannot automatically know whether outputs are correct in all the cases that matter. Verification remains tied to execution under representative conditions and to the authority to change decisions when signals change.
AI can help with parts of verification. It can suggest tests, propose edge cases, and summarize logs. It can raise verification capacity. But it cannot conjure missing intent, and it cannot replace the need to exercise the system and treat the resulting evidence as strong enough to change the release decision. Review is helpful. Review is evidence of readability and intent. It is not, by itself, evidence of correct behavior under load, in the presence of failure, and under adversarial conditions.
Three controls that reduce verification debt
So the operational question is how to keep evidence coupled to decisions as output accelerates. Three controls reduce verification debt without requiring a new org chart.
The shift to measure
Verification debt is not a slogan. It is what happens when evidence cannot keep up with change. Generative AI did not create that imbalance, but it can amplify it by increasing the supply of plausible change. The path forward is to change the currency of trust. In an era where drafting is cheap, the scarce resource is justified confidence. Treat proof as a release time requirement, and measure what matters: time from change to justified confidence, not changes per week.

Kostakis Bouzoukas is a London-based technology leader focused on release gating, incident learning, and production signals that make accountability measurable. He writes in a personal capacity and has no financial ties to the vendors cited.
Submit an Article to CACM
CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.
You Just Read
© 2026 Copyright held by the owner/author(s).
Advertisement
Advertisement
Join the Discussion (0)
The Latest from CACM
Verification Debt: When Generative AI Speeds Change Faster Than Proof

From UX to AUX

Using AI to Address Mental Health

Shape the Future of Computing
ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.
Communications of the ACM (CACM) is now a fully Open Access publication.
By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.
相關文章