A2X← Back to A2XWorkshop resource
Claude Code 301 · Field Notes

Stop Babysitting
Your Agents

As models get smarter, engineers spend more time staring at a screen, playing glorified QA tester. This lesson hands that time back.

The Premise

Your tooling was built for humans.

Linters, IDEs, formatters, type checkers, compilers — nearly all of it was written to make human teams faster. But humans aren't writing most of the code anymore. Agents are. So the toolchain deserves a fresh look.

GOOD NEWS

Most tools port over

Formatters, linters, and symbol servers serve agents just as well as they serve us. Claude wields them effectively.

BAD NEWS

We have blind spots

Humans make silent assumptions about a codebase that an agent simply doesn't share.

The question to carry through this lesson: what does an agent need from your codebase that a human takes for granted?

Table Stakes

Three things to do first.

This is advanced material. Before the real techniques land, these prerequisites should already be in place. Tap each one to check it off.

0 / 3 ready — you'll still get value, but start here.
The Roadmap

Three layers that stack.

Each builds on the one before it. Taken together, they let you work in a way we simply haven't worked before.

LAYER 01

Verification

Teach Claude to check its own work, so its output becomes reliable.

LAYER 02

Multi-Claude

Once it's reliable, run many Claudes at once with confidence.

LAYER 03

Background loops

Take your keyboard out of the hot path entirely. Claude works while you don't.

Layer 01 · Verification

Teach Claude to check itself.

Think about the last feature you shipped. How did you verify it — not just the final output, but each iteration along the way? Most software work breaks into the same sequence.

How humans verify
How Claude verifies
  1. Design and write the code.
  2. Build it — run compilers and type checkers. Fail? Loop back and edit.
  3. Run the executable: a Docker container, a CLI, a web server.
  4. Check side effects — open a browser, scan the logs, inspect the database state.
  5. Run unit tests for regressions; add new tests for the feature.
  6. Deploy to staging. Or, if you're brave, straight to prod.

The same exact playbook works for Claude. Nothing about this sequence is uniquely human. The only requirement is giving Claude the right tools and the right instructions to walk through it the way you would.


The loop is what makes it go.

A loop is an autonomous circuit you complete for Claude so it can hill-climb toward a success criterion. Give it tools to write code and to verify that code, and it cycles: write → check → debug → write again — until it reaches a success state. The pull-request that lands in your inbox is higher quality because it already passed.

REAL EXAMPLE · "Make the signup button work"
Write code Build app Click button in browser Read the logs Fix the bug PR that works
Claude saw the button did nothing, dug into the logs, found the cause, and fixed it — on its own.

Verification comes in flavors.

The core concept never changes — give Claude the tools and instructions to enter a loop. Get that right and these all merge into one capability.

FLAVOR

UX / Frontend

Drive the browser, prove a fix visually.

FLAVOR

Backend

Hit endpoints, check responses and state.

FLAVOR

End-to-end

The whole app, infrastructure included.


What a frontend loop actually needs

Concretely, it boils down to four moves:

  1. Run the application. Spin up the dev server — npm run start, or whatever yours is.
  2. Use the server. Open a browser. Sid's pick is the Claude in Chrome MCP via /chrome; Playwright and other browser-control MCPs work too.
  3. Prove it works. Screenshot before the fix and after the fix; confirm the state actually changed.
  4. Unblock it. Clear the common blockers: auth (give Claude an identity to log in) and state (pre-seed data, e.g. store inventory).

None of this is novel — state-setup scripts are old hat in end-to-end testing. The twist: give Claude access to those scripts and keep them dynamic, not prescriptive, so it can do far more than a static script ever could.


Package it as a skill.

A skill stores arbitrary context about a topic — here, your verification loop. So you can hand it to a teammate, or to your future self. The best part: make it self-improving.

Tell the skill to edit itself every time Claude hits a blocker, and it becomes self-documenting. The Claude Code team runs exactly one verification skill, explicitly told to keep documenting itself. Hit a wall once, and the next person never hits it.

The demo: a confetti loop in Monkeytype

The live demo used Monkeytype — an open-source typing tester (TypeScript + Express, with MongoDB and Redis). A realistic full-stack app. The arc: drive it by hand with the Chrome MCP, distill the session into a skill, then build a new feature and let the skill verify it.

claude code — monkeytype
> spin up the dev server Dev server already running on :3000 ✓ > /chrome status: enabled · extension installed ✓ > use the chrome mcp to make sure the front end is working navigated localhost:3000 · typed test · changed settings → expert ✓ > take everything we learned and put it into a skill file wrote .claude/demo-verification/skill.md 1. bring up the stack 2. load Chrome MCP 3. smoke test ✓ > every time I mistype, show a confetti animation — use the skill to verify your work wrote feature… oxlint errors: 2 → fixed → re-verified 🎉 confetti fires on mistype. loop closed.

That's the loop in the wild: write code, hit lint errors, fix them, re-verify — circling until it reaches a good state. Set one up yourself and you'll likely be running in 5–10 minutes.

Hold Claude's hand once. Then let it fold the lesson into a skill it can reuse forever.

Layer 02 · Multi-Claude

Run many at once — without losing your mind.

Once each Claude is reliable, you can parallelize. The catch: every open session eats your attention, and attention is scarce. Past four or five live sessions, most people stall.

3

So the whole game is protecting attention. Four surfaces help:

GUI

Desktop app

A sidebar with every session across every surface — local terminal, cloud, all git repos. Pin, rename, and color them so you remember what each was doing.

TERMINAL

Claude agents

Love the terminal? Run claude agents. Same sidebar idea, sorted by attention needed — anything blocked on a prompt floats to the top.

CLOUD

Claude on the web

Decouple sessions from your laptop. Walk between meetings, lose your wifi — it runs on. Start at claude.ai/code.

PHONE

Remote control ★

Run /remote-control and steer any session from your phone. It buzzes when Claude needs input — answer from the car.

The old way: a tmux window manager with four panes, each on its own git worktree. It works — but it's a lot to manage. Claude agents replaced it.

Layer 03 · Background Loops

Take your keyboard out of the loop.

Even multi-Claude isn't enough — you still have to spin up each session with a goal in mind. But a lot of engineering isn't features or bugs. It's bookkeeping that needs a loop, just not you in it.

DRUDGERY

Babysitting PRs

Review comments, merge conflicts, CI failures. Twenty PRs a day eats hours.

DRUDGERY

Updating docs

Velocity goes up; docs have to keep pace.

DRUDGERY

Triage & CI

Monitoring feedback, keeping the build green.

/loop — run a prompt on an interval

The /loop command wakes a session on a schedule, runs your prompt, and — if your CLAUDE.md and tools are set up — figures out the rest itself.

claude code
> /loop 10m and babysit my open PRs waking every 10 min · resolving conflicts · nudging CI · clearing review comments…

Routines — /loop, but remote

Routines are /loop running in the cloud, in the same containers as Claude Code on the web. Set one up from the web or desktop app under the Routines tab. Triggers come in two kinds:

TIME-BASED
Update the team docs every day. A fresh Claude Code session opens with a fixed prompt.
EVENT-BASED
Scan incoming issues & feedback and post a digest to Slack every six hours.

Routine work that doesn't need you — handled, on a schedule, without a keyboard.

The Payoff

Stack all three.

Reliable verification makes parallelism safe. Parallelism makes background loops worth running. Together they form a system that does real work while you're nowhere near the keyboard.

Spend your attention on the work you actually care about. Delegate the rest — with high reliability and high confidence.

Check Yourself

Did it stick?