← Back to blog Still life: a black metal box with an ember-orange wax seal against an obsidian background — visual metaphor for the hardened sandbox

Diagnosis · Fix Generation · Sandbox

Knowing isn't enough — the gap between detecting a bug and having a PR ready

Datadog detects. Sentry stacks them up. OpenTelemetry traces. The real cost isn't in seeing the error; it's in the four hours between seeing and merging.

May 06, 2026·5 min read

Modern teams detect everything. Datadog records every exception. Sentry captures every stack trace. OpenTelemetry traces every service. The problem is no longer knowing that something broke.

The problem is what comes after.

The diagnosis tax

Take a typical bug — a NullPointerException on a checkout route. The alert arrived. The engineer opens the stack trace. And then the real work begins.

First: locate the file in the eight-hundred-thousand-line monorepo. Located. Second: understand the code. It was written two years ago by someone who's no longer at the company. There's a strange Optional<> pattern. Why? No comment. Time to read git blame. Third: reproduce. You can't reproduce locally without a specific database environment. Fourth: hypothesis. Fifth: write the fix. Sixth: write the tests. Seventh: review security — this fix touches an authenticated route; did it open any vector? Eighth: open the PR, ask for review, wait.

The actual code change, in the end, is often three lines. But the path to those three lines took four hours.

Where the time really goes

Classic DevOps research (DORA, Accelerate) shows that cycle time — between commit and deploy — is a well-known metric. But the one that matters for perceived productivity is crueler: time from alert to merge. That number is rarely measured. When it is, it usually sits in the range of days, not hours.

And the reason isn't a lack of technical competence. It's context load. Every bug requires re-entering a piece of code that may have never been seen before, reconstructing the history of decisions that led to that state, and still acting carefully to avoid introducing a regression. That's dense work, and nobody does it well in short bursts — which is why bugs stay open for weeks.

Closing the gap, not skipping it

Triage doesn't try to hide the gap. It tries to close it with full transparency.

The diagnosis stage reads the source code and generates an explicit root-cause hypothesis with evidence — which code path was exercised, which input triggered the failure, why the NullPointerException showed up now and not before. The fix-generation stage proposes the minimum change, validated against eight OWASP categories (secret leak, injection, path traversal, XSS, deserialization, dynamic eval, and new-import detection), entropy checks for sensitive strings, and AST syntactic validation.

The sandbox-test stage runs the proposed tests inside a hardened Docker container — read-only filesystem, no network, capped memory. If it passes, the PR is delivered as a draft. It is never auto-merged.

The engineer opens the PR and sees three lines, with the hypothesis, the diagnosis evidence, the test result, and the security audit. Reviews in minutes what would take hours to write. Approves or rejects. The control stays human.

Automation that respects the engineer's judgment

The difference between an aggressive AI tool and an honest operational one is exactly that: the first tries to replace the engineer; the second gives time back. Detecting is easy. Diagnosing and proposing with evidence — without closing the loop on its own — is where the product actually works.

See how Triage delivers draft PRs →