Skip to main content

Command Palette

Search for a command to run...

I Fixed the Same Bug Three Times Before I Realized the Problem Was Not the Bug — It Was the Missing Test

Fixing bugs without writing regression tests creates test debt. The debt compounds silently — and AI coding agents will not stop it for you.

Updated
6 min read
I Fixed the Same Bug Three Times Before I Realized the Problem Was Not the Bug — It Was the Missing Test

The Short Version

For weeks, my AI code reviewer flagged the same classes of bugs across different pull requests. My AI coding agent would fix them every time. Then they came back in the next branch. The problem was not the agent. The problem was that every fix was branch-local — a one-off patch with no durable prevention test. I audited every review comment from the prior weeks, built a matrix mapping each failure mode to a specific regression test, and added a linting gate. The same bugs stopped recurring.

What I Was Doing

I am building a native desktop app with AI coding agents. I am not an engineer. The agents handle implementation, pull requests, and code review. I direct the work and verify the output.

The setup was productive. Features shipped. Bugs got found and fixed. The AI code reviewer caught real issues — stale data overwriting fresh data, overlapping operations clobbering each other, cached views surviving after their source data was removed. Good catches. Every time, the agent would write a fix and the branch would merge.

This went on for weeks. It felt like progress.

What Went Wrong

The same classes of bugs kept appearing in new pull requests.

Not the same exact bugs — the same types. Different files, different features, but the same failure patterns:

  • Stale data loads finishing late and overwriting newer state.

  • Cached detail views surviving after the list refreshed and the selected item was gone.

  • Overlapping start, cancel, and restart paths clobbering each other's final state.

  • Paired database writes leaving partial data when one side failed and the other did not roll back.

  • Mapping loops pulling stale data from the wrong scope, or losing track of which entity they were building.

  • Data contracts between components drifting out of alignment — what one side expected versus what another actually sent.

Every one of these had been caught before. Every one had been fixed before. And every one came back.

The Compounding Problem

When I looked closer, I found I was making two mistakes.

First: every fix the agent wrote was branch-local. It fixed the bug in the current pull request — and stopped there. It never wrote a regression test, the kind of test that says, "if this exact failure happens again, the build fails." Without that test, the fix was a patch. The bug could walk right back in through the next branch, in a different file, and nothing would flag it.

Second: I confused a green pipeline with real coverage. The existing tests passed. Validation ran. Checkmarks everywhere. But none of those tests checked whether the specific failure modes that kept recurring had prevention gates. I was reading "tests ran" as "we are protected," and those are not the same thing.

There was a quieter problem too: no linting gate. Style drift could pass review silently because nothing required linting to run first. But that was secondary. The real compounding issue was the absence of durable regression tests. The gap between "the fix is in" and "this bug cannot come back" was getting wider every week, and no green pipeline was going to close it.

What I Changed

I audited every review comment from the prior weeks. I pulled every finding the AI code reviewer had flagged — not just the current branch, all of them. If a bug class had been caught before, I wrote it down.

I built a comment-to-test matrix. For each finding, I recorded: which pull request, which file, what bug class, whether a prevention test existed, and what the test should cover. The matrix made the gap visible. Several failure modes had no test at all. Several had tests too broad to catch the specific failure they were supposed to prevent.

I added the missing regression tests. Each test was narrow: one failure mode, one named test. For stale data loads overwriting fresher state. For overlapping start and cancel paths. For data contract mismatches between components. For the mapper losing track of which entity it was building.

I added a linting gate. Now linting runs before tests. Style drift cannot accumulate silently — if the code fails linting, the pipeline stops before testing starts. This was the easy fix. The harder one was below.

I changed the definition of done. The new rule: never close a review-cleanup task based only on code fixes or a broad claim that tests were added. If any row in the matrix has no prevention test, add the narrowest regression first. Close only when every proven failure mode maps to a named test.

The Rule I Use Now

Every proven failure mode must map to a named regression test before the branch is called complete.

Not "we added tests." Not "the pipeline passes." A specific test with a specific name that fails if that specific bug class reappears. If a bug was real enough to fix, it is real enough to prevent permanently.

The matrix makes this visible. Instead of guessing whether coverage exists, I can look at a row and see: this failure mode has no test. Add one. Next.

Takeaways

  • Bug fixes without regression tests are patches, not prevention. The same bugs will return — in different files, through different paths, with the same failure mode.

  • "Tests run" is not the same as "tests exist for the things that keep breaking." A green pipeline tells you nothing about whether specific failure modes are guarded.

  • A comment-to-test matrix turns invisible test debt into a visible checklist. Every row without a test is a gap you can close.

  • The AI agent will fix the same bug as many times as you ask it to. It does not track recurrence. It is your job to notice the pattern and make the fix permanent.

  • Linting gates help — they catch style drift before it reaches review — but they are not a substitute for targeted regression tests.

If you are building with AI agents and you recognize a bug you have seen before, stop. Do not fix it again. Map the failure mode to a named regression test first. That one question — "does this failure mode have a prevention test?" — will save you more weeks than any amount of review-after-the-fact. Ask it before you close the branch, and the same bugs will finally stop coming back.

More from this blog

C

Codex Field Notes | Building Apps With Codex and AI Coding Agents

2 posts

Practical field notes from a non-coder building real apps with Codex and AI coding agents. I write about what works, what breaks, project drift, guardrails, and the lessons that make agent-built software easier to trust.