The Productivity Lie: What AI Coding Studies Actually Show

Last month, a developer friend told me he’d tripled his output with Copilot. “I’m mass-producing code now,” he said, genuinely proud. When I asked how he was measuring it, he paused. “I mean… it feels faster.” That pause is the entire story of AI productivity claims in 2025.

Here’s a number that should bother you: in the most rigorous independent study of AI coding tools to date, experienced open-source developers were 19% slower when using AI assistance. Not faster. Slower. But here’s the part that really stings — those same developers believed they were 20% faster.

That perception-reality gap isn’t a footnote. It’s the headline.

The study nobody wanted to talk about

In early 2025, METR (Model Evaluation & Threat Research) ran a randomized controlled trial with 16 experienced open-source developers across 246 real-world tasks in their own repositories. Not toy problems. Not “write a function that reverses a string.” Actual issues in codebases these developers knew intimately.

The setup was clean: tasks were randomly assigned to “AI allowed” or “AI not allowed” conditions. Developers predicted beforehand how much faster they’d be with AI. They predicted a 24% speedup. The actual result? A 19% slowdown.

Bar chart comparing predicted versus actual AI productivity impact

Let that sink in. The developers weren’t just wrong about the magnitude. They were wrong about the direction.

“Software engineering is not typing; it is thinking. When we apply this ‘typing accelerator’ to complex tasks, we create a ‘thinking decelerator.’” — METR study findings

The developers in the study accepted less than 44% of AI-generated code. They spent significant time reviewing, editing, and debugging AI output — time that often exceeded what it would’ve taken to just write the code themselves. (I’ve experienced this firsthand: spending 15 minutes wrestling with a suggestion that took me 3 minutes to write from scratch, because I kept thinking “it’s almost right, just one more tweak.”)

Follow the money

Now, you might be thinking: but what about all those studies showing massive productivity gains? GitHub’s famous Copilot study found developers were 55.8% faster. Google reported ~21% faster code completion. Those are real studies published in real venues. What gives?

Let me show you something. Here’s every major AI coding productivity study I could find, side by side:

Study	Funded by	Finding
GitHub Copilot RCT (2023)	GitHub / Microsoft	55.8% faster
Google Internal RCT (2024)	Google	~21% faster
Microsoft 3-Week Study (2024)	Microsoft	No significant change
METR RCT (2025)	Independent	19% slower
Uplevel Data Labs (2024)	Independent	No gain, 41% more bugs

See the pattern? The studies funded by companies selling AI tools show the biggest gains. The independent studies show no gain — or negative results. This isn’t a conspiracy theory; it’s a well-documented phenomenon in research called funding bias. Pharmaceutical companies figured this out decades ago: if you fund the study, you tend to get the answer you’re looking for.

The GitHub study, for instance, measured how fast developers completed a single, self-contained JavaScript task. Not a complex feature spanning multiple files. Not a debugging session in a legacy codebase. A single, clean task — exactly the kind of work where autocomplete shines. Google’s study measured code completion acceptance, not whether the code was correct or whether it introduced subtle bugs.

And then there’s the Microsoft 3-week study, which is particularly interesting because Microsoft makes Copilot. Even they couldn’t find a statistically significant productivity improvement when they studied actual developer workflows over three weeks instead of isolated tasks.

The bug problem nobody mentions

Uplevel Data Labs tracked real engineering teams over six months — before and after Copilot adoption. Their findings? No measurable change in throughput. But there was a measurable change in something else: bug rates went up by 41%.

Think about what that means. Even if you are producing code faster (and the data says you’re probably not), you’re producing buggier code. That 41% increase in defects has to be fixed eventually. By whom? By you, or by someone on your team, in a future debugging session that’s now harder because the code wasn’t fully understood when it was written.

The gains from faster code generation are often consumed by the cost of fixing what was generated.

This matches what I see in my own work. AI-generated code tends to be plausible-looking but subtly wrong — the kind of wrong that passes a quick review but breaks in production. Off-by-one errors, incorrect edge case handling, assumptions about APIs that are almost right but not quite. The code compiles. The tests pass (the ones that exist, anyway). The bug shows up three weeks later.

The Atlassian Paradox

Here’s a question worth asking: even if AI did make coding faster, would it actually matter?

Atlassian’s 2025 Developer Experience Report found that developers spend only 16% of their time actually writing code. The rest goes to meetings, context switching, waiting on reviews, navigating organizational processes, and — my personal favorite — searching for documentation that may or may not exist.

So AI is optimizing 16% of the workday. Meanwhile, 50% of developers in the same report said they lose more than 10 hours per week to organizational friction — things like unclear requirements, blocked dependencies, and cross-team coordination.

The math doesn’t work. Even a miraculous 50% speedup on the coding portion saves you roughly 3.2 hours per week. But you’re losing 10+ hours to friction that AI doesn’t touch. The gains get absorbed before they ever reach your output.

It’s like putting a turbocharger on a car that’s stuck in traffic. The engine is faster, sure. You’re still not moving.

Ninety-nine percent of developers in the Atlassian survey reported time savings from AI tools. But when measured against actual project velocity, those savings vanished into the organizational machinery. The perception of speed, again, decoupled from the reality of delivery.

The trust is eroding

The Stack Overflow 2025 Developer Survey — one of the largest annual snapshots of how developers actually work — tells another side of this story. 84% of developers now use AI coding tools. Adoption is through the roof. But trust? Trust is cratering.

46% of developers don’t trust the accuracy of AI output, up from 31% the year before. Only 3% say they “highly trust” AI-generated code. The number one frustration, reported over and over: “AI solutions that are almost right, but not quite.”

That phrase — almost right, but not quite — is the most expensive phrase in software engineering. Code that’s obviously wrong gets caught and discarded. Code that’s almost right gets merged, deployed, and becomes someone else’s debugging nightmare.

(I’ve started calling this the “uncanny valley of code” — it looks correct at a glance, it reads like something a competent developer would write, and it’s just wrong enough to be dangerous.)

The Faros AI data

Faros AI, which aggregates engineering metrics across hundreds of teams, has been tracking AI tool impact on actual delivery metrics. Their data paints a nuanced picture: while individual task completion times sometimes decrease, overall cycle time — the time from starting work to shipping it — hasn’t meaningfully improved for most teams.

Why? Because the bottleneck was never typing speed. It was understanding the problem, making design decisions, coordinating with other humans, and verifying correctness. AI accelerates the one part of the process that was already the fastest.

What Cal Newport gets right

Cal Newport, the Georgetown computer science professor who’s built a career studying how knowledge workers actually work, wrote something in late 2025 that I keep coming back to:

“No one knows anything about AI.”

His point isn’t nihilistic. It’s that we’re in a period of such rapid change that confident proclamations — “AI will replace developers” or “AI makes everyone 10x more productive” — are almost certainly wrong. The honest answer is that the evidence is contradictory, the studies are compromised by funding bias, and the long-term effects are genuinely unknown.

The California Management Review published an analysis identifying seven myths about AI and productivity, concluding that most of the headline claims don’t survive contact with rigorous methodology. The productivity gains that do exist tend to be narrow: specific tasks, specific contexts, specific levels of developer experience. They don’t generalize the way the marketing suggests.

So what do we actually know?

After spending weeks reading every study I could find, here’s my honest summary:

AI tools help with boilerplate and simple, well-defined tasks. Writing a basic CRUD endpoint, generating test stubs, scaffolding configuration files — these get faster.
AI tools don’t help (and may hurt) with complex, context-heavy work. Debugging, architecture decisions, working in unfamiliar codebases, anything requiring deep understanding of business logic.
The productivity gains that exist are much smaller than advertised and tend to be measured in ways that favor the tool vendor.
Bug rates appear to increase, which may offset or exceed any time savings.
Developer perception is unreliable. We consistently believe AI is helping more than it actually is.
The bottleneck in software delivery isn’t coding speed. It’s everything else.

None of this means AI coding tools are useless. I use them daily. But the gap between what the headlines promise and what the evidence shows is enormous — and that gap is costing teams real money as they restructure workflows around a productivity gain that may not exist.

The first step toward actually benefiting from these tools is dropping the fantasy that they’ve made us all 10x engineers. They haven’t. And until we’re honest about that, we’ll keep optimizing for a metric that doesn’t matter while ignoring the ones that do.

Next in this series: Part 3 digs into what AI actually changes about the developer’s daily experience — the cognitive load, the skill atrophy, and the new kinds of bugs nobody warned us about.

The study nobody wanted to talk about#

Follow the money#

The bug problem nobody mentions#

The Atlassian Paradox#

The trust is eroding#

The Faros AI data#

What Cal Newport gets right#

So what do we actually know?#

Sources#

Stay in the loop