Back to Insights

The Code Factory Needs a Foreman

Round-the-clock AI agents are real. The unattended factory that ships without a human gate is not.

Hundreds of coding agents now run in parallel for days at a time and produce working software — that part is no longer speculative. The humans can even go home: the agents work through the night and leave a stack of pull requests by morning. But they go home only after writing the specs, and the factory needs them back at their desks to review what it built. Pull the foreman out of that loop and the factory doesn’t run unattended — it just piles up code no one has checked.

The interesting question stopped being can agents write the code? They can. The real question is who sets the task and who checks the result — and that work is getting more expensive, not less.

The Swarm Is Real

In January 2026, Cursor pointed hundreds of concurrent agents at a single goal and let them run for about a week. They wrote over a million lines of code and built a working web browser from scratch. Letting the agents coordinate as equals did not work. A strict hierarchy did: planners broke the work down, workers implemented it, and a judge agent decided at the end of each cycle whether to keep going. The biggest lever was not the model or the structure. It was the prompts.

Simon Willison cloned and compiled the result on his own machine and rendered real web pages with it. Having predicted AI-built browsers for around 2029, he concluded he "may have been off by three years." This is not a demo reel. Enterprises now report agents that run on their own for hours inside a multi-million-line codebase and land a complex change in a single run. Queueing work overnight and waking up to a stack of pull requests is now an ordinary workflow, not a party trick.

The adoption numbers track the capability. A majority of enterprises with engineering teams already run at least one coding agent in production, and most developers now use AI as part of their daily work. The factory floor is real, it is busy, and it runs around the clock.

But the Foreman Never Left

Look closely at every working example and you find the same shape: a human is orchestrating, not typing. An engineer decomposes the work, dispatches five, ten, twenty agents across repositories, and then reviews and merges what comes back. The agents do the implementation. The human sets the intent, owns the architecture, and gates the output.

That gate is not temporary scaffolding, waiting to be automated away. The same teams shipping with agents say they can fully hand off only a small share of their tasks — even though they use AI across most of their work. And one of the report’s own predictions is not more autonomy, but agents that know when to stop and ask for help. Escalation, not independence, is the feature people actually want.

So the bottleneck moved. It used to look like writing code. Now it is plainly reviewing it. The factory produces output far faster than any human can read it, and reading it is the part that still requires judgement.

The foreman is not a leftover the automation hasn’t reached yet. The foreman is the product.

Generation Was Never the Constraint

Writing code was never the scarce resource for a competent team — understanding whether the code is correct was. Agents have flooded the cheap half of the equation and left the expensive half untouched.

The trust gap is wide, and well measured. Most developers say they do not fully trust the accuracy of AI-generated code. Teams that adopt agents watch pull requests pile up — more of them, and bigger — while review time climbs to match. Net out that extra review burden and the real gain is far smaller than the generation speed suggests. Trust, not generation, has become the real bottleneck. The throughput is real. The value you keep is whatever survives review.

The instinctive answer is to have AI review the AI. That helps with the mechanical layer — style, obvious bugs, missing tests — but it raises a sharper question. Does adding a second model add trust, or just dress up the lack of it? An agent that approves another agent's work has not added judgement. It has put a more confident-sounding stamp on the same unchecked code.

The DORA research lands the point bluntly: AI is an amplifier, not a fixer. Strong engineering systems get faster. Weak ones get more visibly unstable, sooner.

The Economics Invert

Run the arithmetic and a strange thing happens. By the hour, an always-on agent fleet is far cheaper than human engineers. An agent works every hour of the year; a salaried engineer works a fraction of them. But price it per line of trusted code and the picture flips. All that output still has to be verified, and verification is paid for in senior salaries.

And the size of the bill is not even the real story. How much it swings is. A salary is a fixed, predictable number you can budget. Token spend is metered and spiky, and it can blow through its budget in a quarter. One major chipmaker's leadership says engineers should be spending tokens worth a large share of their salary every year; one experimental agent fleet ran up a seven-figure API bill in a single month, across hundreds of billions of tokens.

Companies know how to plan headcount. They do not yet know how to plan an engineer with a metered fuel cost that no one can forecast.

The Apprenticeship Paradox

The factory's quality depends entirely on the skill of the foreman — and that is precisely the skill it is eroding at the entry level. Studies of junior engineers who lean hard on AI show a measurable drop in how well they understand their own code — and a sharper drop in their ability to debug it. The ones who handed code generation fully to the AI were the worst at explaining how their own systems worked.

At the same time the bottom rungs of the ladder are being pulled up. Entry-level technical hiring has fallen sharply, and organisations are flattening — middle layers thin out as agents absorb coordination work. The pattern is consistent. Agents take over the work juniors used to do, while making senior judgement count for more. The org chart loses its bottom and middle before it loses its top.

The judgement that decides whether agent output is safe to ship is senior judgement, built over years of doing the work by hand. If the entry-level path that produces it closes, where do the next foremen come from?

When the Factory Has No Foreman

The real danger zone is where nobody is qualified to read the output at all. When non-developers ship production software built entirely by agents, the failure is predictable — because the thing that breaks is invisible to the person at the controls.

One launch became the cautionary tale. Its founder proudly noted he had written not a single line of code. Researchers soon found the platform exposing roughly 1.5 million API keys and tens of thousands of user emails — all because one database setting was never switched on. The application worked in every sense its builder could perceive. Working was never the thing that needed checking. Veracode found that AI-generated code carries a security flaw in close to half of all cases — a flaw a non-developer has no way to see and no framework to catch.

The boundary is not "can they build it." Agents can build it. The boundary is whether anyone is motivated to break it. A foreman-less factory is genuinely useful for internal tools, prototypes, and throwaway automation — code with no real users, no personal data, and no adversaries. The moment real users, real data, or real attackers enter the picture, the absent reviewer becomes the whole risk.

Building the Gate

If verification is the constraint, then the work worth doing is making verification cheaper and more trustworthy — not chasing more generation. That is a practitioner's problem, and it has practitioner's answers.

It starts with writing the intent down so it lasts. We run spec-driven workflows: the specification, not a closed chat window, is the source of truth the agent works from. The wider industry is making the same shift — the spec becomes the lasting artefact, and the code is derived from it. On top of that sits a set of checks the agents cannot talk their way past. Static analysis runs on every change. End-to-end and visual tests catch what a confident model misses. And we keep the architecture small and reviewable, so a person can actually read each change instead of rubber-stamping it. We run this across a portfolio of production codebases — the only honest way to know it holds up. A gate that forces a person to read and understand each change is also where the next foreman is trained: the apprenticeship the factory is eroding, rebuilt at the point of review.

None of that removes the foreman. It is what the foreman uses. The teams that win the next few years will not be the ones that bought the most agents — they will be the ones that built the gate well enough to let the factory run hard against it.

Drowning in Agent Output?

We help engineering teams build the verification gate — specs, static analysis, and tests — that makes AI throughput safe to ship.
Start a Conversation