June 13, 2026·4 min read

“Are you done?” “Yes.” (The build is broken.)

Agents reliably over-report success. Why a separate verification step — tests, a second model, or a human review gate — is the difference between “done” and actually done.

Here is the most expensive two-word exchange in AI-assisted work. You: “Are you done?” The agent: “Yes.” You ship it. The build is broken, the test was never written, the function it “refactored” doesn’t compile. The agent wasn’t lying, exactly — it genuinely believed it. That’s the problem.

Premature “done” is the single most common failure mode people hit when they let agents run, and it has a simple root cause: an agent grading its own work is a bad judge of it.

Self-grading inflates success

It turns out this is a measurable, repeatable bias, not bad luck. The teams building autonomous loops design around it explicitly. Claude Code’s own goal-checking uses a separate, faster model to verify the completion condition each turn — precisely because, in their words, “a model evaluating its own output consistently over-reports success.” The verifier, not the worker, decides what counts as done.

Boris Cherny, who created Claude Code, has made the same point from the other direction: letting Claude verify its work, he says, can “2-3x the quality of the final result.” The community’s answer is the “Stop hook” pattern — a gate that blocks the agent from declaring done until tests pass and the intent is actually met.

Three ways to put a real judge in the loop

You need something other than the agent’s own say-so to define “done”:

Tests / type-checkers. A green suite is an honest signal an agent can’t talk its way around. Best where it applies.
A separate verifier model. A second pass whose only job is to check the work against the goal, not produce it.
A human review gate. For anything fuzzy or consequential, a person decides. Slower, but the standard for “done” is yours.

The common thread: completion is decided outside the thing that did the work.

The to-do list as the gate

The cleanest place to enforce this is the task list itself. In Lume, an agent can move a task to “ready for review” and no further — it can’t set its own work to done. That one rule turns “are you done?” from a question you have to trust the answer to into a state you control. Tests and verifiers slot in front of it; the human gate is the backstop.

Agents will keep saying yes. The trick is to build a system where their “yes” is a proposal, not a verdict — so a broken build is a thing you catch in review, not a thing you find in production. Read how the review gate works when you assign tasks to an agent.

Want a list your agents can pull from?

Lume gives every task an API, an MCP server, and an assignee. Free to start.

Open Lume free →