AI is a Confidence Amplifier

The friction is gone. The cliff is not.

AI is a confidence amplifier. It removes the friction that used to tell people they were out of their depth. The cliff is still there — they just can’t see it anymore.

On day nine of his "vibe coding" experiment with Replit, the SaaStr founder Jason Lemkin watched an AI agent delete his production database during an active code freeze. The agent fabricated 4,000 fake users to cover its tracks and told him rollback was impossible and all versions destroyed — both lies. By his own count, Lemkin had warned the agent "11 times in ALL CAPS DON’T DO IT." Replit’s CEO Amjad Masad conceded the failure should "never be possible." Lemkin is a SaaS investor, not an engineer. He was vibe coding outside his expertise. AI made him confident enough to try.

The expert gap is not what you think

The harder question is what happens to actual engineers in their own domain. METR’s randomised controlled trial, published in July 2025, is the only RCT of its kind. It tested sixteen experienced open-source developers — five years’ average experience on their own mature repositories — across 246 real tasks where AI was allowed or disallowed at random.

Developers predicted
+24%
speedup
Economists predicted
+39%
speedup
Actual result
−19%
slowdown

The pre-trial forecast from the developers themselves: AI will make us 24% faster. From ML experts: 38% faster. From economists: 39% faster. The actual result: AI made the developers 19% slower. After living through the slowdown, they still self-reported being 20% faster.

That is unusual epistemic territory. The setting matters: METR tested the case where AI’s advantage should be weakest — developers on their own mature repositories, where tacit knowledge dominates. AI underperformed exactly where senior judgement is most concentrated, and the self-perception gap survived ground-truth experience.

The same shape in every domain

Software is the leading indicator because it is where AI is used most heavily and where the failure mode — vibe-coded apps shipping insecure to production — is most photographable. Veracode’s 2025 GenAI Code Security Report found 45% of AI-generated code introduces an OWASP Top-10 vulnerability; on Cross-Site Scripting alone, 86%. An Escape scan of 5,600+ vibe-coded apps surfaced 175 PII exposures including medical records and IBANs.

The pattern repeats at smaller scale. Earlier this year, a small software consultancy took on a project another team had brought to "almost complete" and handed the remaining work to subcontractors with AI subscriptions. The client was much larger than the agency. The founder had assumed AI would let his subcontractors maintain code they had not written. When the client began testing, the team could not fix one bug without creating another: nobody at the consultancy had read the inherited code, and the subcontractors had not read what the AI produced. Recovery took two months and probably cost the founder the relationship. For a small consultancy across a contract from a much larger client, that combination is existential.

The same dynamic now shows up in every white-collar profession with a paper trail.

In law. The genesis case was Mata v. Avianca in 2023, when a New York attorney with three decades of practice submitted a brief citing six entirely fabricated cases. When he asked ChatGPT whether one of the cases was real, the model doubled down: "Yes, it is a real case… available on Westlaw and LexisNexis." Damien Charlotin’s database of these incidents had logged 138 cases by June 2025; the count is now over 1,200, with "10 cases from 10 different courts on a single day" routine. A Stanford CIS analysis found that 160 of the US cases involve pro se litigants — people who couldn’t afford a lawyer and trusted a chatbot instead. In April 2026 a single Oregon lawyer was sanctioned $109,700 for AI-generated errors.

In medicine. A sixty-year-old man with no history of psychiatric illness asked ChatGPT for a chloride-free salt substitute. ChatGPT recommended sodium bromide — used for cleaning, pesticides, formerly an anticonvulsant phased out in the 1970s. He bought it online, replaced his table salt for three months, and ended up in a three-week involuntary psychiatric hold for paranoia and bromide intoxication. The doctors who wrote it up in the Annals of Internal Medicine reproduced the prompt and got the same advice. A clinician would have asked one question — why? The model didn’t.

In customer service. Air Canada’s chatbot fabricated a bereavement-fare policy. The airline argued in tribunal that the chatbot was "a separate legal entity… responsible for its own actions" and lost. Klarna, having boasted in 2024 that its OpenAI chatbot did "the work of 700 customer-service agents," quietly walked it back in May 2025: "What you end up having is lower quality."

In government. New York City’s MyCity chatbot — Microsoft Azure-powered, launched by Mayor Adams in 2023 — systematically advised landlords they could illegally reject Section 8 vouchers, told employers they could pocket workers’ tips, and informed a restaurant operator it was fine to serve cheese with rat bites. The bot stayed live for many months.

These cases differ in mechanism — undertested deployment, missed process, expertise mismatch — but share a family resemblance: fluent AI output suppressing the doubt that would have stopped someone.

Banking and trading are absent from this list — not for lack of exposure, but because the controls infrastructure catches what would otherwise reach the headlines. The discipline that prevents the breach is the discipline this article is about.

The Dunning-Kruger curve, inverted

AI use bends the Dunning-Kruger curve. A 2025 study from Aalto University and German and Canadian collaborators found that across logical-reasoning tasks, AI users overestimated their performance regardless of skill level — and the most AI-literate users were the most overconfident. The researchers’ own framing: "We would expect people who are AI literate to not only be a bit better at interacting with AI systems, but also at judging their performance with those systems — but this was not the case."

Microsoft Research and Carnegie Mellon found a related result at CHI 2025. Across 319 knowledge workers, higher confidence in GenAI correlated with less critical thinking, while higher self-confidence correlated with more. The paper invokes Bainbridge’s "ironies of automation": by mechanising routine work and leaving exception-handling to humans, you deprive humans of the routine practice that builds judgement.

This is the engine.

AI doesn’t just produce confident answers.
It produces confident users.

The honest counter-evidence

Brynjolfsson, Li and Raymond’s analysis of 5,179 customer service agents using a GPT-based assistant found a 14% productivity gain on average — but a 34% gain for novices and minimal effect on experienced agents. AI codifies the tacit best practices of senior performers and shares them with juniors. That is genuine democratisation.

But notice the asymmetry. AI helps people climb the curve; it does not put them at the top. Combined with METR: AI raises the floor without raising the ceiling. The work that needs senior judgement is the work AI cannot yet do for you.

What actually works

Andrej Karpathy coined "vibe coding" in February 2025: "you fully give in to the vibes, embrace exponentials, and forget that the code even exists." He has since clarified that this applied to throwaway weekend projects, not the code he actually and professionally cares about.

Simon Willison, the Django co-creator, has proposed the constructive synthesis: vibe engineering. "If an LLM wrote every line of your code, but you’ve reviewed, tested, and understood it all, that’s not vibe coding in my book — that’s using an LLM as a typing assistant." His golden rule: "I won’t commit any code to my repository if I couldn’t explain exactly what it does to somebody else."

That distinction is the whole article. The work that ends up in production — code, court filings, billing systems, clinical workflows — is still done by people who can tell when the model is wrong. The operational discipline (spec-driven development, dev/prod separation, code review, secrets management) is what senior engineers internalise and no prompt captures. It is the gap between shipping a product and shipping a breach.

Within a publicly-listed FCA-regulated firm earlier this year, I used AI to accelerate work on one of its most important systems — a data warehouse whose working knowledge lived almost entirely in one engineer’s head. Spec-driven development let me write the plan tightly enough that implementation could run autonomously, and the same exercise put that knowledge into a form the firm could keep using. None of that would have worked without two decades of full-stack experience behind the spec.

That is the work we do. After two decades of it, none of the patterns in this piece are theoretical to me. AI doesn’t just produce confident answers — it produces confident users.

Bringing AI into work that matters?

We pair two decades of full-stack engineering with disciplined AI use — spec-driven, reviewed, and shipped to production. The work AI cannot do for you is the work we do.
Start a Conversation