Here is the most expensive two-word exchange in AI-assisted work. You: “Are you done?” The agent: “Yes.” You ship it. The build is broken, the test was never written, the function it “refactored” doesn’t compile. The agent wasn’t lying, exactly — it genuinely believed it. That’s the problem.
Premature “done” is the single most common failure mode people hit when they let agents run, and it has a simple root cause: an agent grading its own work is a bad judge of it.
Self-grading inflates success
It turns out this is a measurable, repeatable bias, not bad luck. The teams building autonomous loops design around it explicitly. Claude Code’s own goal-checking uses a separate, faster model to verify the completion condition each turn — precisely because, in their words, “a model evaluating its own output consistently over-reports success.” The verifier, not the worker, decides what counts as done.
Boris Cherny, who created Claude Code, has made the same point from the other direction: letting Claude verify its work, he says, can “2-3x the quality of the final result.” The community’s answer is the “Stop hook” pattern — a gate that blocks the agent from declaring done until tests pass and the intent is actually met.
Three ways to put a real judge in the loop
You need something other than the agent’s own say-so to define “done”:
- Tests / type-checkers. A green suite is an honest signal an agent can’t talk its way around. Best where it applies.
- A separate verifier model. A second pass whose only job is to check the work against the goal, not produce it.
- A human review gate. For anything fuzzy or consequential, a person decides. Slower, but the standard for “done” is yours.
The common thread: completion is decided outside the thing that did the work.
The to-do list as the gate
The cleanest place to enforce this is the task list itself. In Lume, an agent can move a task to “ready for review” and no further — it can’t set its own work to done. That one rule turns “are you done?” from a question you have to trust the answer to into a state you control. Tests and verifiers slot in front of it; the human gate is the backstop.
Agents will keep saying yes. The trick is to build a system where their “yes” is a proposal, not a verdict — so a broken build is a thing you catch in review, not a thing you find in production. Read how the review gate works when you assign tasks to an agent.