One of the most instructive things I read this year wasn’t about how clever AI agents have become. It was about how much boring plumbing it takes to keep them useful. A developer ran Claude Code essentially 24/7 for six months on a 2015 MacBook Air with 8GB of RAM and logged 857 autonomous sessions and hundreds of self-merged pull requests — on hardware most of us would call e-waste.
The headline number is fun. The lessons underneath it are the actually valuable part, because they generalize to anyone who wants agents working on their tasks while they sleep.
The intelligence was never the bottleneck
Almost none of the six months was spent making the model smarter. It was spent on infrastructure: keeping processes alive with launchd, watching for silent failures, guarding scope so the agent didn’t wander, and solving the same blank-slate problem over and over — the agent can’t remember what it did yesterday, or two hours ago. Every long-running setup ends up reinventing the same primitives: a durable queue of work, a way to hand off context between sessions, and guardrails so a confident-but-wrong agent can’t do damage.
Never trust the agent’s own summary
The sharpest lesson: an agent’s self-report is not evidence. As the author put it, Claude “will give you a confident, fluent answer that is partially fabricated.” So instead of trusting “yes, I finished that,” the framework extracts what actually happened from the deterministic record — the tool calls in the session logs — and judges completion on that.
This is the same truth that shows up in every serious agent setup: the model over-reports success, so something outside the model has to decide what’s really done. A real test suite. A separate verifier. Or a human reviewing the result before it counts. Pick at least one, or your “completed” tasks are fiction.
What this means if you just want your list cleared
Most people don’t want to hand-build 49 launchd services. But you can take the architecture and skip the toil, because the three things that mattered all map to features, not scripts:
- A durable queue. The agent needs a real, queryable list of assigned work to pull from — not a markdown file. A task list with an MCP server and an API is exactly that.
- Context handoff. Each task carries its own notes and history, so an agent starting fresh has what it needs without remembering the last session.
- A verification gate. Work comes back as “ready for review,” and you — or your tests — decide it’s done. Agents never check their own box.
That’s the whole idea behind running agents against Lume: assign tasks to Claude Code or Codex, let them work from a real queue, and review what comes back. You don’t need a homelab — you need a list the agents can pull from and a gate they can’t skip.
The 2015 MacBook proved the ceiling is higher than the hardware. The catch is that the ceiling is made of plumbing, and the plumbing is worth getting right. For the full first-person account, it’s well worth the read.