David vs. Goliath: 9B vs 10T on Firefox

May 5, 2026 | 5:00AM CST — Updated

LIVE SCAN — 0% COMPLETE

David vs. Goliath: Not a Fight, but Who Can Outperform the Other

In early 2026, Anthropic's Mythos — widely rumoured as a 10-trillion parameter model built for deep code auditing — scanned the massive Mozilla Firefox repository and flagged 271 security issues. The top 3 were serious enough to become published CVEs, patched immediately by Mozilla. The remaining 268 findings remain under responsible disclosure and have not been made public.

We wanted to know: can a 9-billion parameter model, running on a single consumer GPU, find what a model 1,000x its size found?

The Battle of Firefox — a 9B parameter model faces a 10 trillion parameter titan

Our tool, Roasty, is a hostile code review engine built on Qwen 3.5 9B (Q8_1 quantized), running on a single NVIDIA RTX 3090. This is deliberately our smallest and weakest model on our weakest GPU — we chose it to create the widest possible gap between the two. If a 9B model can compete with a 10-trillion parameter model, the argument is over before it starts. It dispatches every code chunk through a panel of specialized security reviewers — each one trained to hunt a different class of vulnerability. The scan covers the same Firefox codebase Mythos reviewed.

The scan launched on April 24, 2026. As of May 4, 2026 | 10:00AM CST, 444 of 455 chunks have been processed (97% complete). The scan is expected to finish today. Once complete, findings enter a multi-phase post-scan pipeline — adversarial verification, cross-file deduplication, static analysis cross-validation, and proof-of-concept generation — before the final report is available. This process takes additional time beyond the scan itself.

Early Findings by Reviewer

Each reviewer specializes in a different attack surface. Here is what they have flagged so far:

Reviewer	Findings	Chunks

Severity Breakdown

Raw, Pre-Consensus

Severity	Count
Critical	0
High	0
Medium	0
Low	0
Informational	0

That is 0 findings from 0 of the 36 LLM-powered reviewers, plus 235 Hyrex findings (our custom rules engine), for a running total of 235 findings at 0% completion. The remaining 20 reviewers are also scanning across chunks but have not flagged findings in their specialized attack surfaces yet -- they are approving clean in their lanes, which is expected for a C/C++ browser codebase where certain vulnerability classes (container escape, prompt injection, wallet drain, weight poisoning) are less likely to appear.

These are raw numbers, and they will come down. Two things inflate the count by design:

Duplicates. Multiple reviewers can flag the same underlying issue from different angles. A buffer overflow might get caught by Compliance as a standards violation, by Chaos as a crash vector, and by Exploit as an attack surface — all pointing to the same line of code. That is not three bugs. It is one bug that three specialists independently confirmed. Deduplication collapses these into a single finding with multiple corroborating sources, which actually strengthens the finding rather than inflating the count.

False positives. A reviewer might flag code that looks vulnerable in isolation but is safe in context — guarded by a check upstream, constrained by the caller, or dead code that never executes. These are expected in any automated scan. The verification pipeline exists specifically to eliminate them: an adversarial agent tries to prove the finding wrong, and if it succeeds, the finding is discarded. Only what survives gets counted.

What matters is not the raw total but how many survive verification — and whether any of them overlap with what Mythos found.

What Happens Next

Once the scan completes, each finding goes through a four-phase verification pipeline:

Adversarial verification -- an adversarial agent attempts to disprove each finding. If it succeeds, the finding is discarded.
Static analysis cross-validation -- static analyzers validate from a different angle, independent of the LLM reviewers.
Deduplication -- findings flagged by multiple reviewers or across overlapping chunks are collapsed into single corroborated findings.
Proof-of-concept generation -- surviving findings get a PoC built in a sandboxed environment.

Only findings that pass all four phases count. The goal is not to generate the most findings -- it is to generate findings that are real, reproducible, and defensible.

Responsible Disclosure

We will publish aggregate statistics — total verified findings, severity breakdown (critical, high, medium, low, informational) — but nothing that reveals specific vulnerabilities or shows anyone how to exploit Firefox. Verified findings are reported to Mozilla under responsible disclosure. If Mozilla determines the issues are valid, they patch them and release them publicly as CVEs on their own timeline. Until then, the details stay between us and Mozilla.

If a 9B model on a $1,500 GPU can independently rediscover even a fraction of what a 10-trillion parameter model found, that changes the conversation about who gets to do serious security research.

If Roasty comes in 30% under Mythos, it still changes the landscape. A 9B model on a single consumer GPU producing even 70% of what a 10-trillion parameter model found means the barrier to entry for serious security auditing just collapsed. No million-dollar compute budget required.

If Roasty outperforms Mythos, then the entire premise that you need frontier-scale models for deep code security falls apart. It means architecture and specialization beat raw parameter count.

Roasty's slingshot?

A tiered memory system that offloads context into Atlas — the first truly deterministic RAG engine in the industry. No probabilistic retrieval. No context loss. That is how a 9B model processes 1.6 billion tokens (44M/tok x 36 of 137 agents) in a single scan with pinpoint retrieval accuracy.

Why do we call it the first? Because what the industry calls "deterministic RAG" today is not deterministic at all.

Some systems, like AvocadoDB, market themselves as deterministic because the same vector query returns the same nearest neighbors, verified by SHA-256 hash. That is repeatable retrieval — not deterministic retrieval. The underlying mechanism is still cosine similarity against an embedding space. You are still asking "what is closest?" and getting a fuzzy answer. The fact that you get the same fuzzy answer twice and hash it does not make it precise.

Others achieve the appearance of determinism through hidden prompt engineering — re-ranking layers, constrained decoding, and system-level instructions that the user never sees. The retrieval is still probabilistic, but the output is forced into a consistent shape by a second layer of LLM processing sitting between the retrieval and the response. It looks deterministic from the outside. Underneath, it is a probabilistic retrieval system with a prompt wrapper bolted on top to mask the inconsistency. The moment the underlying model changes, the prompt tuning changes, or the re-ranker drifts — the "determinism" breaks.

A third variant calls itself "rule-based retrieval" — metadata filters applied on top of a vector store. Documents still get chunked, embedded, and stored in a vector database. The "rules" are just filename and page number filters that narrow which vectors get searched. The core retrieval is still cosine similarity against an embedding space, just scoped to a smaller subset. That is a WHERE clause on a fuzzy query — not determinism. Change the embedding model, re-chunk the documents, or rephrase the query, and you get different results. Filtering which vectors to compare against does not change the fact that you are still asking "what is closest?"

Atlas does none of these. It executes structured semantic LoreToken queries against a governed index — no embeddings, no vector distance, no hidden prompt layers, no "close enough." When Atlas returns a result, it is the correct result. Not the nearest neighbor. Not the highest-probability match. The result with pinpoint accuracy. That distinction is the difference between a search engine and a database — and it is why a 9B model can process 1.6 billion tokens without losing a single finding to retrieval drift.

From Mozilla Themselves

A user claiming to be a Mozilla employee posted the following on Reddit:

"Hi, Mozilla employee here...For bugs found internally, Mozilla doesn't issue one CVE per bug but instead internally found bugs go into so called 'roll-up' advisories with a link to the bug list covered. For this effort specifically, all of the Mythos bugs were found internally and are part of the following three roll-up advisories:

The number of actual bugs can be seen through the amount of bug ids in Bugzilla link that is part of each advisory. Hope this helps!"

— Beginning-Reach3215 on r/singularity

This is significant context. The three CVEs are not three isolated bugs -- they are roll-up advisories, each covering multiple Bugzilla entries. The actual number of individual bugs patched under those three CVEs is higher than three. This aligns with the 271 total findings Mythos flagged, and clarifies how Mozilla's internal disclosure process bundles AI-discovered issues into consolidated advisories rather than issuing one CVE per finding.

Mozilla also confirmed the numbers directly. In a blog post by Bobby Holley (April 21, 2026), Mozilla stated that Firefox 150 includes "fixes for 271 vulnerabilities identified during this initial evaluation" of Mythos Preview, plus 22 additional bugs found earlier by Opus 4.6 and fixed in Firefox 148.

Those 271 individual findings were bundled into 42 CVEs through Mozilla's roll-up advisory process. The advisory (mfsa2026-30) rates the CVEs -- not the individual findings -- at these severity levels:

CVE Severity	Count
Critical	0
High	0
Moderate	19
Low	0

271 findings → 42 roll-up CVEs. Zero rated critical. Source: mfsa2026-30

Independent Research: "The Jagged Frontier"

On April 7, 2026, Stanislav Fort — Chief Scientist at AISLE, former Google DeepMind researcher and former Anthropic alignment team member — published "AI Cybersecurity After Mythos: The Jagged Frontier", a paper that directly challenges the premise that frontier-scale models are required for serious security work.

Fort tested eight different models — including one with only 3.6 billion active parameters — against the same vulnerabilities Mythos flagged. All eight detected the flagship FreeBSD NFS exploit. A 5.1 billion parameter open model recovered the full analysis chain of a 27-year-old OpenBSD bug requiring mathematical reasoning about signed integer overflow. His conclusion: "The moat is the system, not the model."

AISLE has independently discovered over 180 CVEs across 30+ projects using model-agnostic approaches, including 12 of 12 OpenSSL zero-days in a single release — some hiding for decades. Their track record is the strongest independent evidence that architecture and orchestration matter more than parameter count.

Fort's paper acknowledged one gap: his tests used pre-scoped code with contextual hints, not full-repo unsupervised scans. Our Firefox scan is that test — a 9B model scanning the full repository with no pre-scoping, no hints, and no hand-holding. Two independent teams, different approaches, same thesis.