We Reviewed 473 AI-Built Codebases. The Security Crisis Is Worse Than Anyone Is Reporting. -- ShipIt Clean

May 27, 2026

Research

We Reviewed 473 AI-Built Codebases. The Security Crisis Is Worse Than Anyone Is Reporting.

TL;DR: Over four months, CodeForge analyzed 473 completed code reviews spanning web applications, APIs, SaaS platforms, and open-source projects. The system deployed up to 42 specialized AI security agents per review, consuming 13.5 billion tokens of analysis. The result: 210,798 total findings, of which 54,126 are security-specific. 86% of codebases contain at least one high-severity vulnerability. 73% contain at least one critical. And 10% contain multi-step exploit chains -- the class of vulnerability that no single-model scanner can structurally detect. This is not a sample. This is our entire production dataset.

The Numbers

473

Codebases Reviewed

210,798

Total Findings

54,126

Security Findings

13.5B

Tokens Analyzed

86%

Have High-Severity Vulns

73%

Have Critical Vulns

4,417

Critical Findings

45,368

High Findings

Why This Report Exists

Several security firms have published reports quantifying the problem of AI-generated code. Sherlock Forensics found that 92% of AI-generated codebases contain critical vulnerabilities. Snyk, Veracode, and others have released similar findings. These reports are directionally correct and we applaud the work.

But they all share a structural limitation: they're built on single-pass analysis. One model, one scan, one set of eyes. That methodology can find injection flaws, hardcoded secrets, and missing rate limiters -- the vulnerabilities that exist in isolation within a single file. What it cannot find is the class of vulnerability that emerges from the interaction between components.

CodeForge was built specifically for this gap. We deploy up to 42 specialized security agents per review -- each focused on a different attack surface -- with a consensus pipeline that cross-validates findings and eliminates false positives. The result is a dataset that captures not just what's broken in AI-generated code, but how it breaks when the pieces are assembled.

The Security Breakdown

Category	Findings	Critical	High
General Security	37,791	1,510	25,325
Input Validation	2,619	5	769
Denial of Service	1,960	2	571
Error Handling & Info Leakage	1,900	0	111
Data Exfiltration	1,482	12	699
Credential Exposure	1,471	279	206
Tenant Isolation	1,458	219	703
Multi-Step Attack Chains	1,429	115	524
Authentication & Authorization	1,354	15	278
Cryptographic Weakness	1,322	2	79
Edge-Case Security	1,340	3	153

The credential exposure number deserves emphasis: 1,471 findings across 37 reviews. That's API keys, database passwords, JWT secrets, and cloud credentials either hardcoded in source or committed to version control. In AI-generated code, this happens because the model optimizes for "make it work" -- and hardcoding a secret is the fastest path to a working demo. The AI doesn't think about what happens when that code reaches a public repository.

The Problem Single-Pass Scanners Can't See

1,429 of our findings are classified as multi-step attack chains -- vulnerabilities that only become exploitable when code from multiple files or modules interacts in a specific sequence. No individual file contains the vulnerability. The vulnerability is the interaction.

We proved this architecture against Mozilla Firefox itself. Our 42-agent scan of Firefox's full source processed 44 million tokens of code and found 8,475 cross-interaction exploit paths, of which 2,712 survived adversarial verification. These include chains where an unsanitized input in one module feeds into a privileged execution path in another -- patterns that are invisible to any tool that reads code one file at a time.

Why this matters for AI-generated code specifically: When a human developer builds a system, they carry architectural context across files. They know that the auth middleware protects the admin route. AI code assistants generate each file in relative isolation. They produce correct components that compose into insecure systems. The vulnerability isn't in the code the AI wrote -- it's in the assumptions the AI didn't make.

What the AI Gets Wrong

After reviewing 473 codebases, clear patterns emerge in how AI-generated code fails at security:

1. Authentication is implemented but not enforced. AI models are excellent at generating login flows, JWT signing, and session management. They routinely forget to apply those mechanisms to the routes that need them. We see beautifully implemented auth middleware sitting next to unprotected admin endpoints in the same codebase.

2. Input validation exists at the front door but nowhere else. Request validation on the API layer, but raw user input flowing into database queries, file paths, and shell commands deeper in the call stack. The AI validates where it's obvious and trusts everything downstream.

3. Secrets are treated as configuration. API keys in environment variables that get committed. Database passwords in config files that ship to Docker images. The AI follows the pattern it was trained on -- and its training data is full of tutorials that hardcode credentials for simplicity.

4. Error handlers leak internal state. Stack traces, database schemas, file paths, and dependency versions returned in production error responses. The AI generates helpful error messages. Helpful to attackers.

5. Multi-tenant boundaries don't exist. 1,458 tenant isolation findings -- user A's data accessible to user B through predictable IDs, missing ownership checks, or shared resource pools without access control. AI models generate single-user applications. Adding multi-tenancy after the fact is where the critical vulnerabilities live.

How CodeForge Works -- And Why It Finds What Others Miss

Single-Pass Analysis

One model reads the code
Findings from one perspective
File-level granularity
No cross-validation
High false-positive rate
Cannot detect interaction vulnerabilities

CodeForge Multi-Agent Review

Up to 42 specialized agents per review
Each agent owns a different attack surface
Cross-file and cross-module analysis
Consensus pipeline eliminates false positives
Agent memory tracks patterns across chunks
CrossForge engine maps exploit chains

The consensus pipeline is the critical differentiator. When Agent Razor flags an SQL injection and Agent Permit independently flags missing authorization on the same endpoint, and Agent Schema confirms the database schema allows the escalation -- that's a corroborated finding with a verified attack path. When only one agent sees it and no others can reproduce the concern, it gets downgraded or eliminated. Our 99.8% false-positive elimination rate in the Firefox scan came from this architecture, not from a better model.

Methodology

Dataset

473 completed reviews (Jan-May 2026). 373 direct code submissions, 49 full repository scans, 38 ZIP uploads, 13 pull request reviews. Languages: primarily Python, JavaScript/TypeScript, Go, Rust. Applications range from pre-launch MVPs to production systems.

Analysis Engine

CodeForge multi-agent hostile review. Up to 42 agents per scan across 5 LLM tiers (Claude, Kimi, DeepSeek, GPT-4o, local Qwen 9B). 13.5 billion tokens of total analysis. Consensus pipeline with adversarial verification on all findings rated High or above.

Classification

Findings mapped to OWASP Top 10 2021, CWE taxonomy, and internal CodeForge severity matrix. Severity levels: Critical (exploitable with system-level impact), High (exploitable with significant impact), Medium (exploitable under specific conditions), Low (defense-in-depth concern), Info (observation).

Validation

Multi-agent consensus required for High and Critical findings. Adversarial verifier agent challenges each finding with counter-arguments. Findings that don't survive adversarial review are downgraded. PoC testing performed on select critical chains. Survival rate tracked per agent for quality calibration.

Recommendations

AI code generation is not going away, and it shouldn't. The productivity gains are real. But the security gap is equally real, and it requires a deliberate response:

Treat AI-generated code as untrusted input. The same way you wouldn't trust user-submitted form data without validation, don't trust AI-generated code without security review. It's fast, it's functional, and it's probably insecure.
Multi-perspective review is not optional. A single-pass scan will find the obvious flaws. The dangerous vulnerabilities -- the ones that lead to breaches -- live in the interactions between components. You need multiple specialized reviewers examining different attack surfaces.
Scan before you ship, not after you're breached. 73% of the codebases in our dataset contain critical vulnerabilities. If you're shipping AI-generated code to production without hostile review, the question isn't whether you have vulnerabilities -- it's whether anyone has found them yet.
Fix credential hygiene at the source. Environment variables are not secrets management. Use a vault. Scan every commit for entropy patterns. Rotate anything that's been committed, even to a private repository.
Test the composition, not just the components. Unit tests prove individual functions work. Integration security testing proves they work safely together. AI code passes unit tests and fails at composition. Test accordingly.

About This Report

This report is based on the complete production dataset from ShipIt Clean's CodeForge engine -- 473 completed reviews, 210,798 findings, and 13.5 billion tokens of multi-agent analysis conducted between January and May 2026. No sampling, no extrapolation, no synthetic data. Every number in this report comes from real code reviewed by real agents.

CodeForge is available now. Submit your codebase for review and find out what your AI wrote that you haven't seen yet.

Disclosure: ShipIt Clean is a commercial code review platform. We have an obvious interest in people scanning their code. We also have the largest real-world dataset on AI code security in production. Both things are true. Draw your own conclusions from the data.