My AI agent kept breaking old code until I made it prove the change
A practical story about adding verification to AI-assisted coding so every agent run ends with evidence, not hope.
The problem
The agent finished the task and said it was done.
I wanted to believe it.
The page loaded. The new section looked fine. Then I clicked a different page and found the break. A shared component had changed. A route still built, but the layout was off. The agent had solved the visible task and quietly damaged something nearby.
That is the uncomfortable part of AI-assisted coding: the agent can sound confident before the project is safe.
What I used to do
I used to review the final answer more than the final state.
If the agent said “I updated the component and everything should work,” I treated that like progress. But “should work” is not evidence. It is a vibe with better grammar.
The fix was to stop asking the agent if it was done and start asking it to prove what changed.
The new rule
Every spec gets a verification section before coding starts.
For a static Astro project, the minimum proof is simple:
npm run build
That catches broken imports, invalid content, bad routes, and TypeScript issues. It does not catch everything, but it catches enough that skipping it is silly.
For UI work, I add behavior checks:
- Does the page render at the route I changed?
- Does the link go to the right place?
- Does the copy button still copy the right text?
- Does mobile layout still make sense?
- Did the agent touch files outside the spec?
Now the agent has to end with evidence.
The happy path
The workflow became:
- Write a small spec.
- Add verification steps to the spec.
- Let the agent implement one unit.
- Run the checks.
- Review the diff against the spec.
- Commit only when the evidence matches the claim.
This changed the tone of the work.
Before, I was reading AI output and trying to decide if I trusted it. After, I was reading build output and inspecting a smaller diff. That is much easier.
The prompt I use
Read the project context and the current unit spec.
Implement only the files needed for this unit.
When done, run every verification step from the spec.
In your final note, include:
- files changed
- commands run
- any checks you could not complete
- any files you touched outside the spec, with a reason
The last line matters. It makes scope creep visible.
What verification does not solve
Verification does not make AI code automatically good.
A build can pass with bad UX. Tests can pass while the product feels wrong. A screenshot can look fine while the copy is confusing.
But verification changes the default from “the agent says it is done” to “the project gave us evidence.”
That is a better place to start review.
Questions people ask
Should the agent run commands itself?
Yes, when the command is safe and part of the repo workflow. For this project, npm run build is expected. For commands that deploy, publish, delete, spend money, or message people, stop and ask first.
What if the build fails?
Good. You found the problem before the user did. Give the agent the error and keep the fix inside the same spec boundary.
Should I commit after every passing build?
Commit after a passing build and a human diff review. The build is evidence, not permission.
The lesson
A confident agent is nice.
A verified change is better.
Need this in your product?
I help teams turn agent experiments into useful systems.
If your team is trying to build with AI agents, I can help you shape the workflow, write the specs, review the architecture, or build the first production-ready agent with you.
Where I can help
- Agent workflow design for real products
- Context files, specs, review loops, and verification
- Hands-on builds for internal tools, content systems, and product agents