The Real Work of Orchestrating AI Coding Agents

Last night I ran three AI coding agents simultaneously -- two OpenAI Codex sessions and one Claude Code session -- across four repositories. They shipped 20+ pull requests, addressed 10+ GitHub issues, wrote two 400-line technical design documents, and handled their own merge conflicts, CI failures, and code review feedback.

This is not a demo. This is what my actual Tuesday looked like.

The lesson was not "AI writes code now." Everyone can see that. The lesson was that useful agents need an operating model: assignment, context, review, evidence, escalation, and release authority. Without that layer, parallel agents are just faster ways to create ambiguous work.

This is the concrete version of the shift I described in async code generation: once code can happen in the background, the scarce skill becomes operating the work.

The Setup

Three tmux sessions on a Linux dev-desktop, accessed over SSH from my Mac. Each agent working a different repo:

Cerebro (Codex): A Go graph database engine. Removing a Snowflake dependency, implementing deployment profiles, wiring NATS change capture.
Platform (Codex): A Python/FastAPI evaluation platform. Fixing test suites, resolving merge conflicts, addressing RBAC security findings.
Maestro (Claude Code): A TypeScript agent framework. Implementing unified thinking-level abstractions, extension system groundwork.

A fourth session -- Hopper, our Next.js marketing site -- ran earlier and churned out 19 PRs of blog posts and documentation pages before I shut it down.

What the Orchestrator Actually Does

I wrote zero lines of code. My job was:

Directing via issues. Every task started as a GitHub issue with an implementation comment. Not "fix the tests" -- a specific breakdown: which files, which patterns to follow, which branch name. The agents read issue comments with gh issue view and work from there.

Checking in. Every few minutes: tmux capture-pane to read what each agent is doing. Are they stuck? Making progress? Burning context on a dead end? This is the core loop. Check, direct, queue, check.

Queue management. Codex supports Tab to queue follow-up prompts. After "fix the CI failures," I Tab-queue "then merge the dependabot PRs" and "then pick up issue #7521." This creates a pipeline of work that flows without me touching it.

Unblocking. When an agent hits a sandbox permission wall, a merge conflict it can't resolve, or a disk-full error -- I intervene. Fix the infrastructure problem, then let the agent continue.

Merging. I watch CI, merge green PRs with gh pr merge --squash --admin, and rebase branches that fall behind main. The agents create PRs; I decide when they ship.

The Control Plane

The work only held together because every agent action had somewhere to land.

A task ledger. GitHub issues were the durable source of intent. The issue said what to change, how to verify it, and what not to touch. The agent session was disposable. The task record was not.

Execution lanes. Each agent owned a repo, language, and branch. That reduced merge conflicts and made review easier because every PR had a clear operating context.

Evidence gates. A PR was not "done" because the agent said it was done. It needed tests, CI, review-thread resolution, and a human decision to merge.

Release authority. Agents could propose and repair. I kept the final production-changing decision. That boundary matters. It is the difference between autonomy and accountable delegation.

What Actually Works

Issue comments as coordination layer

The single most effective pattern: post detailed implementation guidance as a GitHub issue comment before the agent starts. This persists beyond the session. When an agent dies at 12% context and I restart it fresh, I just say "read issue #711" and it picks up exactly where the guidance says to start. No prompt reconstruction. No lost context.

Backend-specific agents

Giving each agent a single repo and language worked far better than asking one agent to context-switch between Go, Python, and TypeScript. The agents internalize project conventions -- import patterns, test structures, commit message formats -- and stay consistent. Context-switching between codebases burns tokens on re-learning.

Tab-queued pipelines

Codex's Tab queue is underrated. A well-loaded queue means the agent transitions smoothly from "fix CI" to "merge PRs" to "start feature work" without me sending a new prompt each time. I front-loaded 3-4 queued tasks per session and checked in less frequently.

Kill without sentiment

A session at 14% context with three queued tasks needs a restart, not encouragement. Kill it, restart with --dangerously-bypass-approvals-and-sandbox, give it a one-paragraph summary of where things stand, and Tab-queue the remaining work. The new session at 100% context will outperform the struggling one in minutes.

What Does Not Work

Agents do not proactively check for review feedback

This was the biggest operational gap. Cursor Bugbot posted high-severity findings on PRs -- RBAC permission downgrades, data race conditions, SQL injection patterns -- and the agents shipped and moved on. They never circled back. I had to manually audit every open PR for unresolved comments and then interrupt agents to address them.

Agents will take the path of least resistance on deployment

Without explicit GitOps instructions, agents will kubectl apply directly, push to main without a branch, or run terraform apply from their shell. You must state the deployment model in the first prompt and again as an issue comment. "ArgoCD watches k8s/, all changes via git" needs to be said every time.

Long reasoning phases look like hangs

Codex enters 5-minute "Working..." phases where it is planning but producing no visible output. The first few times I interrupted these. Wrong move -- the agent was actually making good decisions about how to structure a complex change. The tell: if the context percentage is dropping, it is working. If it is frozen, it is stuck.

Pre-commit hooks are the biggest time sink

More agent time was lost to pre-commit hook failures than to actual logic bugs. Git hooks that run linters, type checkers, OpenAPI validators, and UUID audits add 2-5 minutes per commit attempt. When the hook fails on something unrelated to the agent's changes, the agent enters a fix-hook, retry, fail, fix-hook loop. Bypassing with --no-verify and validating manually was always faster.

The Economics

In one session:

Cerebro: 8 PRs (7 merged), 8 issues addressed, 1 Snowflake dependency fully removed
Platform: 5 PRs merged, test suite fixed, 2 design documents posted (750+ lines total), merge conflicts resolved on a 45-file PR
Hopper: 19 PRs merged, 107 blog posts, 29 documentation pages
Maestro: 2 PRs, 3 issues worked, unified thinking-level abstraction shipped

The constraint was not agent capability. It was my orchestration bandwidth. Three agents was the sweet spot -- I could maintain the check-direct-queue loop with enough frequency that no agent sat idle or went off-track for long. With four, I started dropping context on what each was doing.

The Meta-Lesson

The value of AI coding agents is not in any individual output. A single PR from Codex is fine. It is roughly what a junior engineer would produce with clear direction.

The value is in parallelism with direction. Three agents working simultaneously, each with a clear issue, a queued pipeline, and periodic course correction, produce more in a few hours than a solo developer produces in a week. Not because the agents are better -- because they never context-switch to Slack, never take a coffee break, never lose motivation on the third test fix in a row.

The orchestrator role -- the person managing the agents -- is the new bottleneck. The skill is not prompt engineering. It is the same skill that makes a good engineering manager: clear task definition, fast feedback loops, knowing when to intervene and when to let things run, and never doing the work that can be delegated.

The agents are the easy part. The system around them is the hard part.

That is the real product surface for AI work: not a chat box, not a prettier IDE, not a clever prompt. The product is the control plane that lets a person safely delegate work, inspect the result, correct the system, and ship with confidence.