1.7x More Bugs, 2.74x More Vulnerabilities: The AI Code Quality Crisis

Last November, I reviewed a pull request that looked immaculate. Clean variable names, reasonable structure, inline comments explaining the logic. It was a payment flow handler — the kind of code where getting it wrong means customers get charged twice or not at all. I almost approved it.

Then I noticed the retry logic. On failure, it retried the charge but never checked whether the first attempt had actually gone through. Under load, it would silently double-bill customers. The code was syntactically perfect and logically catastrophic. It had been generated by an AI tool, lightly edited, and submitted with a commit message that said “refactor payment handler.”

That PR haunts me. Not because it was the worst code I’ve ever seen — it wasn’t. Because it was the most convincing bad code I’ve ever seen. And the data says this pattern isn’t an anecdote. It’s a trend.

The numbers nobody’s contesting

CodeRabbit, an AI code review platform, analyzed 470 pull requests — half human-written, half AI-generated — and published their findings in early 2025. The results aren’t subtle:

AI-generated code contained 1.7x more issues overall
75% more logic errors
Performance problems appeared 8x more often
XSS vulnerabilities nearly tripled

Bar chart comparing AI versus human code quality across issues, logic errors, and vulnerabilities

These aren’t cherry-picked from some adversarial benchmark. These are real pull requests, reviewed by the same automated analysis, comparing real output from developers and AI tools in production contexts. And CodeRabbit isn’t an anti-AI company — they sell AI code review. Even they couldn’t spin the numbers.

The Uplevel study I covered in Part 2 found a 41% increase in bug rates after Copilot adoption. CodeRabbit’s data confirms it and adds granularity: AI doesn’t just produce more bugs, it produces more of every kind of bug. Logic errors. Performance issues. Security holes. The failure mode isn’t narrow — it’s systemic.

211 million lines don’t lie

If CodeRabbit gives us the close-up, GitClear gives us the aerial view. Their 2025 study analyzed 211 million lines of code across thousands of repositories — the largest code quality dataset available. What they found should alarm anyone responsible for a production codebase.

Refactoring collapsed. Before widespread AI adoption, refactoring accounted for roughly 25% of code changes. By 2025, it had sunk below 10%. For the first time in GitClear’s measurement history, copy-paste code surpassed refactoring as a proportion of all code changes.

Code churn doubled. Churn — code that gets written and then rewritten or deleted within a few weeks — is a leading indicator of poorly understood, poorly planned work. When developers write code themselves, they tend to think before they type. When AI generates it, there’s less upfront reasoning and more “generate, test, throw away, regenerate.”

Refactoring sank from 25% to less than 10%. Copy-paste surpassed refactoring for the first time ever. This is what a codebase on its way to becoming unmaintainable looks like.

Here’s what this means in practice. Refactoring is how codebases stay healthy. It’s the equivalent of regular maintenance on a building — not glamorous, not visible to end users, but absolutely essential to long-term structural integrity. When refactoring drops, technical debt accumulates. When copy-paste rises, that debt compounds. And when both happen simultaneously? You’re building on sand.

I’ve seen this in my own work. AI tools are phenomenal at generating new code but remarkably bad at improving existing code. They don’t understand the history of a codebase, the trade-offs that led to a particular design, or the implicit contracts between modules. So instead of refactoring, they add. And add. And add. The codebase grows, but it doesn’t improve.

The silent failure problem

Here’s what scares me more than bugs that crash: bugs that don’t.

IEEE Spectrum published a warning in late 2025 about what they call “silent failures” in AI-generated code. These are cases where the code runs, produces output, passes basic validation — and the output is wrong. Not error-wrong. Subtly, quietly, undetectably wrong.

The specific patterns they documented are chilling. LLMs removing safety checks that they don’t “understand” are necessary. Generating output that matches the expected format without actually computing the correct result. Creating validation logic that looks rigorous but has gaps that only manifest with specific edge-case inputs.

One developer quoted in the IEEE Spectrum piece said something that stopped me cold:

“I am sometimes going back and using older versions of LLMs.”

Think about what that implies. The newer models are better at producing convincing-looking code, which makes them worse at producing code you can actually trust, because the subtle errors are harder to spot. The improvement in surface quality is masking a stagnation — or deterioration — in deep correctness.

This is worse than a crash. A crash is honest. It tells you something broke. A silent failure lies to you. It tells you everything’s fine while producing wrong results that propagate through your system, corrupt your data, and erode trust in ways that take months to diagnose. (I’ve spent more debugging hours tracking down silent failures than any other category of bug. They’re the ones that make you question your own sanity.)

The security nightmare

If code quality is the slow-burning crisis, security is the one with a fuse.

Veracode — one of the oldest names in application security — published a comprehensive study testing AI-generated code against standard security benchmarks. The headline number: 45% of AI-generated code fails OWASP Top 10 security tests. Nearly half. Of the most basic, well-documented, been-talking-about-them-for-twenty-years security vulnerabilities.

It gets worse by language and category:

Java: 72% failure rate against OWASP Top 10
Cross-Site Scripting (XSS): 86% failure rate
Even the best-performing model (Claude Opus 4.5 Thinking) was only secure 56% of the time

Let me restate that last point. The best AI model available, running in its most careful reasoning mode, produces code with exploitable security vulnerabilities 44% of the time. That’s the ceiling, not the floor.

45% of AI-generated code fails OWASP Top 10 tests. These aren’t obscure vulnerabilities. They’re the basics. The ones we’ve been teaching developers to avoid for two decades.

The Lawfare Institute’s analysis goes further, documenting how AI models routinely generate code with SQL injection vulnerabilities, path traversal flaws, and insecure deserialization patterns — vulnerabilities that a junior developer in a decent security training program would know to avoid. The models aren’t just failing to follow best practices. They’re actively reproducing the worst practices from their training data.

And here’s the kicker from the Veracode data: security performance isn’t improving even as syntax gets better. Newer models generate more readable, better-structured, more idiomatically correct code. But the security flaws persist at roughly the same rate. The models are getting better at looking right while remaining just as vulnerable.

The review bottleneck

“But we review all AI-generated code before merging it!” I hear this constantly. And it raises an uncomfortable question: how’s that going?

Faros AI, which aggregates engineering metrics across hundreds of development teams, has been tracking what happens to code review cycles as AI adoption increases. Their finding: PR review time increases by 91% on high-AI teams.

Read that again. The teams using AI the most — the ones generating code the fastest — are spending nearly double the time reviewing it. The bottleneck didn’t disappear. It moved. Instead of waiting for code to be written, teams are now waiting for code to be reviewed. And unlike code generation, review can’t be parallelized as easily or delegated to a machine (because the machine is the one that created the mess in the first place).

A developer quoted in The Register put it bluntly:

“1 out of 10 PRs created with AI is legitimate.”

One in ten. If that’s even close to representative, it means 90% of AI-generated pull requests are noise — code that shouldn’t have been submitted, code that needs substantial rework, code that wastes the reviewer’s time. We’ve traded a writing bottleneck for a review bottleneck and somehow convinced ourselves that’s progress.

The Forrester analysis covered by InfoQ connects this directly to technical debt: organizations adopting AI coding tools are accumulating debt faster than they can service it, because the review process can’t keep pace with the generation process. It’s like opening a fire hydrant into a garden hose — the volume has increased, but the capacity to handle it hasn’t.

The Osmani problem

Addy Osmani, an engineering lead at Google, described something he calls “the 80% problem” in agentic coding. AI tools can get you 80% of the way to a working solution remarkably fast. The remaining 20% — the edge cases, the error handling, the security hardening, the integration with existing systems — takes as long as it ever did. Sometimes longer, because now you’re debugging someone else’s approach rather than your own.

His observation cuts to the heart of the quality crisis:

“Each incremental % of AI progress requires exponentially more human oversight.”

This isn’t linear. Going from 80% to 90% doesn’t require 10% more human effort. It requires substantially more, because the final 20% is where all the complexity lives. It’s the error handling for conditions the AI didn’t anticipate. It’s the performance optimization for edge cases that don’t appear in benchmarks. It’s the security hardening that requires understanding the threat model, not just the syntax.

The Stack Overflow 2025 survey confirms this from the developer side: the number one frustration with AI tools remains “solutions that are almost right, but not quite.” That “almost” is doing heavy lifting. Almost right means it passes your initial scan. Almost right means it looks like it works in testing. Almost right means the bug ships to production because the gap between “almost” and “actually” is invisible at review speed.

The fundamental tension

Here’s what I keep coming back to after reading every study I can find on AI code quality: every study showing speed gains either doesn’t measure quality, or shows quality declining. No study has shown both speed improvements and quality improvements simultaneously.

Not one.

The Opsera 2025 benchmark report confirms a pattern that should trouble anyone building a team strategy around AI: teams report faster cycle times alongside increasing defect rates, growing security vulnerabilities, and expanding code churn. The speed is real. The quality cost is also real. And we’re treating them as separate conversations when they’re the same conversation.

This isn’t a conscious trade-off. That’s what worries me. A conscious trade-off means you’ve weighed the costs and benefits and decided the speed is worth the quality hit. What’s actually happening is closer to sleepwalking — teams adopt AI tools because the speed gains are visible and immediate, while the quality costs are diffuse and delayed. You see the pull request that was generated in ten minutes. You don’t see the production incident it causes three months later, or the security vulnerability it introduces, or the refactoring debt it adds to a codebase someone else will inherit.

What I actually do about this

I’m not going to tell you to stop using AI coding tools. I use them daily. But I’ve changed how I use them based on what the data shows, and I’d suggest you do the same.

I never let AI write security-critical code. Authentication, authorization, payment handling, data validation, encryption — I write these by hand. The 45% OWASP failure rate is disqualifying. Full stop.

I review AI code like I’d review a junior developer’s code — line by line, with skepticism, checking assumptions. Not a quick scan. An actual review. If that sounds slow, it is. But it’s faster than debugging a production incident.

I use AI for exploration, not production. When I’m prototyping an approach, sketching out a data structure, or trying to understand an unfamiliar API, AI tools are genuinely helpful. When I’m writing code that will ship, they’re a starting point, not a finished product.

I refactor aggressively. The GitClear data about declining refactoring rates is the most alarming number in this entire article, because it describes a slow death. If AI is encouraging your team to add new code instead of improving existing code, you need to consciously push back against that tendency.

This crisis isn’t about AI being bad. It’s about AI being convincing. The code looks right. The variable names are sensible. The structure is clean. And underneath that polished surface, the logic is wrong, the security is compromised, and the maintainability is eroding. The better AI gets at surface quality, the harder the real problems become to spot.

We’re not in a speed crisis. We’re in a verification crisis. And until the industry treats it that way, the numbers are only going to get worse.

Next in this series: Part 8 looks at what happens when entire organizations restructure around AI-generated output — the management decisions, the team dynamics, and the new failure modes nobody anticipated.

The numbers nobody’s contesting#

211 million lines don’t lie#

The silent failure problem#

The security nightmare#

The review bottleneck#

The Osmani problem#

The fundamental tension#

What I actually do about this#

Sources#

Stay in the loop