We Scanned All of Firefox with 42 AI Security Agents. Here's What We Found -- Including What Anthropic's Mythos Can't See.
ShipItClean.com scanned the entire Mozilla Firefox repository. 44 million tokens of source code, scanned by 42 specialized security agents simultaneously -- that's over 1.8 billion tokens of analysis. And we did it twice.
The first time, we used our smallest model. A 9 billion parameter model fine-tuned for security review, running on a single NVIDIA 3090, with 36 agents -- about 1.6 billion tokens of analysis. The second time, we brought in the heavy artillery -- 42 domain-specific reviewers running on a 40 billion parameter model, scanning from scratch with a completely redesigned pipeline. 1.85 billion tokens.
The scan is complete. Here's what we found.
Why We Can See What Others Can't
Every AI security scanner on the market -- including Anthropic's Mythos -- has the same fundamental limitation: context window. Even Mythos, with its reported 1 million token context window, can see roughly 750K lines of code at once. Firefox has millions. So every scanner, including Mythos, has to analyze code in chunks. File by file. Function by function.
The problem is obvious: vulnerabilities don't care about file boundaries.
A harmless-looking input handler in one module. A data transform that strips validation in another. An eval three directories away that executes the result. Each file passes every security review on its own. The vulnerability exists only in the path between them.
We solved this with two things: Atlas and CrossForge.
Atlas is our deterministic RAG (Retrieval-Augmented Generation) system -- as far as we are aware, the first true deterministic one the market has seen. Built on SAIQL, our semantic AI query language, Atlas gives every reviewer access to the full codebase context regardless of their native context window. No fuzzy middle. No lost-in-the-middle degradation. Every reviewer can reference any file in the repository at any point during their analysis.
We've all experienced those moments where Claude or ChatGPT seems to go stupid mid-conversation. That's fuzzy context -- models read the beginning and end of their window fine, but the middle 50% degrades. To reliably analyze Firefox without that problem at our level, you'd need roughly 3.7 billion tokens of clean context -- double the actual content to account for the degraded middle. We effectively have that. Atlas holds it all, deterministically.
Update (2026): The "lost in the middle" problem has been largely solved by current-generation models -- Claude 4.x and GPT-4o show near-perfect recall across their full context windows. It still surfaces occasionally in very long sessions, but the 50% degradation described above was a 2023-2024 era limitation. Atlas remains valuable regardless: context windows still have hard ceilings, and deterministic retrieval from persistent memory solves a different problem than attention across a single window.
CrossForge (our cross-interaction analysis engine) then traces data flow, import chains, function calls, and configuration paths between findings across the entire codebase. It takes the individual-file findings from our 41 reviewers and asks: do any of these connect? Does a medium-severity finding in module A become critical when combined with a high-severity finding in module B?
The First Scan: 9B Model, 4 Chains
Our first scan used Roasty -- our fine-tuned Qwen 3.5 9B model running locally. Even at that scale, CrossForge found 4 cross-interaction vulnerability chains that no single-file scanner could have detected.
One of them is a cross-module injection chain in a browser subsystem. Unsanitized input accepted in one component -- looking harmless in isolation -- flows through to a privileged execution context in a completely separate module. Each component passes individual review. The vulnerability only exists in the path between them.
We PoC-tested this chain. The injection path is confirmed present. Firefox's process isolation mitigates full escalation in current builds, so we rated it High rather than Critical. Details will be disclosed to Mozilla through responsible disclosure before any public release.
We held off submitting to Mozilla. If the 9B found 4 chains, what would a larger model find?
The Second Scan: 42 Reviewers, 86,965 Findings
For the second scan, we redesigned the pipeline and brought in 42 specialized reviewers running on a 40 billion parameter model -- each prompted for a specific vulnerability domain. Injection specialists, memory safety analyzers, cryptography auditors, authentication reviewers, race condition detectors, and more. Each reviewer independently scans every chunk of the codebase with no inter-reviewer communication.
Stage 1: Multi-Agent Scan
42 reviewers x 623 code chunks = 86,965 consensus findings after deduplication and merge.
Stage 2: Source Indexing
2,811 source files indexed from the local Firefox clone. 20MB of source context built for downstream analysis.
Stage 3: CrossForge
This is where the magic happens. CrossForge analyzed all critical and high-severity findings as anchors, then evaluated medium-severity findings in the same directories as interaction candidates.
327 batches. 5 hours of processing. 8,475 cross-interaction interactions identified.
8,475 potential exploit chains spanning multiple files and modules. Paths that no per-file scanner -- Mythos included -- would see.
Stage 4: False Positive Filter
At this point the pipeline is working with 86,965 original findings plus 8,475 CrossForge interactions -- 95,590 total. A fast CPU-only pass removes obvious junk -- findings where the scanner found the word "vulnerability" in a string literal, or flagged security scanning code as insecure. This removed about 150 findings and demoted some severity levels, leaving 95,440 findings.
Stage 5: Adversarial Verification
This is where we get honest about the numbers. 95,440 findings is a raw count. A significant portion of those are false positives. The question is: which ones are real?
Our first pipeline used an LLM refinement step that asked the model to classify each finding. The prompt said "be ruthlessly accurate, remove false positives." The result? It marked 99.8% as false positive. Only 16 findings survived from over 24,000 processed. That was garbage.
We threw it out. Completely redesigned the approach.
Pipeline v2 flips the burden of proof. Instead of asking an AI "is this real?", we ask it to prove each finding is NOT real. The AI acts as a defense attorney for the code. It must build a concrete argument -- citing specific mitigations, upstream validation, or framework protections -- for why each finding is not exploitable. If it can't build that defense, the finding stands as verified.
The adversarial verifier processed all 95,440 findings in 397 batches. Final results:
- 2,712 verified -- the adversary could not disprove these
- 45,196 disproved -- concrete defenses identified (e.g., all "tenant isolation" findings correctly identified as non-applicable to a desktop browser)
- 5,027 inconclusive -- sent to a tiebreaker pass that resolved 3,381 as false positives, leaving 2,585 requiring further analysis
The final verified count is a fraction of the raw 95,440. That's the point. We'd rather submit 2,712 findings we can prove than 95,000 we can't.
What We Had to Fix
We believe in showing the work, including the parts that didn't work.
The refinement prompt disaster. Our first attempt at LLM-based false positive removal was fundamentally broken. "Be ruthlessly accurate" sounds reasonable. In practice, it tells an LLM to default to "not real" -- which it did, for 99.8% of findings. Lesson learned: how you frame the question determines the answer. "Remove false positives" produces removal. "Prove it's not real" produces analysis.
Pipeline restart vulnerability. During our first complete run, a second process accidentally launched the pipeline, killing the first instance and overwriting 8 hours of logs and checkpoints. Root cause: no PID lockfile, and the log file used truncate mode (>) instead of append (>>). Fixed with a PID lockfile that refuses to start if another instance is running, append-mode logging, and checkpoints at every stage boundary.
JSON response truncation. When we increased our adversarial verifier batch size to 150 findings per batch, the model's responses were exceeding its output token limit. The JSON would get truncated mid-array -- no closing bracket, parse failure, 100% error rate. Fixed by switching to compact JSON responses (verdict only, no explanation text in the JSON).
These aren't embarrassing. We're building something that's never been built -- it's advanced and growing more sophisticated with every iteration. We're going to fall down sometimes, but we get back up, find what we did wrong, and fix it. The pipeline that works is the one that survived its own failures.
Mythos Found 271 Bugs. We Found 8,475 Paths.
Let's talk about the elephant in the room.
Anthropic announced Project Glasswing -- $100 million, 11 tech giants, an unreleased model called Mythos that they say can "surpass all but the most skilled humans" at finding vulnerabilities. BBC covered it. Finance ministers discussed it at the IMF. It's serious work.
Mythos found 271 bugs in Firefox in a single evaluation pass. That's real, validated work -- Mozilla had over 100 engineers fixing them. Credit where it's due.
But let's look at what those 271 actually are. The 271 announced bugs were part of 423 total security fixes in April alone, rolled into three CVEs: CVE-2026-6784 (154 bugs), CVE-2026-6785 (55 bugs), and CVE-2026-6786 (107 bugs). 180 rated high severity, 80 moderate, 11 low. Some of these are genuinely sophisticated -- Mozilla's own blog describes a NaN-over-IPC sandbox escape and a race condition in IndexedDB enabling use-after-free in the parent process. Credit where it's due: those are real, hard-to-find vulnerabilities.
But every one of them was found by pointing an agentic harness at a specific file or subsystem. The harness writes a test case, runs it, confirms the bug. That's powerful -- but it's still targeted, per-component analysis. The harness doesn't trace how a medium-severity input handler in module A feeds data into a high-severity eval in module B three directories away.
CrossForge found 8,475 cross-interaction paths in the same codebase. Not bugs in files -- chains between files. Data flows that trace untrusted input through module boundaries, across security domains, into exploitable sinks. The kind of paths that actually get exploited in the wild, because they're the kind that no individual code review -- human or AI -- catches when looking at one file at a time.
We're not claiming 8,475 confirmed vulnerabilities. After adversarial verification and tiebreaker analysis, 2,712 findings survived as verified -- with 2,585 more still inconclusive. But every verified finding comes with something Mythos doesn't provide: the exploit path. Not "this function is vulnerable," but "here's the chain from untrusted input to exploitable output, across these three modules, and here's the proof."
Mythos found bugs -- some of them genuinely sophisticated. CrossForge finds the roads between them that no one is looking for.
What's Next
The adversarial verifier is complete. Surviving findings will go through Hyrex -- our deterministic rules engine that cross-validates findings independently of the AI reviewers. If both our AI reviewers and Hyrex flag the same code location, that's two fundamentally different analysis methods agreeing.
After that, we plan to scan Firefox again with a fine-tuned model built specifically for security review -- combining domain-specific training data with the reasoning depth of the 40B. If the 9B found 4 cross-interaction chains and the current scan found 8,475 interaction candidates, a purpose-built model should find significantly more verified chains.
Verified findings will be submitted to Mozilla through responsible disclosure. Every submission will include the cross-module exploit path, PoC test results, severity assessment, and affected file locations.
A note on scale: the 40B is the largest model we can run locally on our current hardware. We haven't officially started reaching out to investors yet. If we had the budget for a pair of Blackwell 6000s -- 192GB of operating VRAM -- we could scan with significantly larger models and likely find even more cross-interaction chains. For now, we're doing this with what we have. And what we have is already finding things that $100 million programs aren't looking for.
Anthropic gave Mythos to 11 companies behind closed doors. We're building the tool that anyone can use -- with cross-interaction analysis that sees what Mythos can't.
The scan is complete. The Firefox findings speak for themselves.