David vs. Goliath: Not a Fight, but Who Can Outperform the Other
In early 2026, Anthropic's Mythos — a 10-trillion parameter model built for deep code auditing — scanned the massive Mozilla Firefox repository and flagged 271 security issues. The top 3 were serious enough to become published CVEs, patched immediately by Mozilla. The remaining 268 findings remain under responsible disclosure and have not been made public.
We wanted to know: can a 9-billion parameter model, running on a single consumer GPU, find what a model 1,000x its size found?
Our tool, Roasty, is a hostile code review engine built on Qwen 3.5 9B (Q8_1 quantized), running on a single NVIDIA RTX 3090. This is deliberately our smallest and weakest model — we chose it to create the widest possible gap between the two. If a 9B model can compete with a 10-trillion parameter model, the argument is over before it starts. It dispatches every code chunk through a panel of specialized security reviewers — each one trained to hunt a different class of vulnerability. The scan covers the same Firefox codebase Mythos reviewed.
The scan launched on April 24, 2026 and will not complete until approximately May 3rd. But here is where things stand as of 1:00 PM CST on April 28, 2026, with 196 of 455 chunks processed (43% complete).
Early Findings by Reviewer
Each reviewer specializes in a different attack surface. Here is what they have flagged so far:
- Compliance: 50 findings across 9 chunks (most active reviewer by far)
- Supply: 31 findings across 6 chunks
- Chaos: 14 findings across 2 chunks
- Trace: 11 findings across 3 chunks
- Pedant: 8 findings across 4 chunks
- Lockdown: 7 findings across 2 chunks
- Provenance: 6 findings across 2 chunks
- Fuse: 6 findings in 1 chunk
- Razor: 5 findings in 1 chunk
- Entropy: 2 findings in 1 chunk
- Exploit: 1 finding in 1 chunk
- Mirage: 1 finding in 1 chunk
That is 142 findings from LLM-powered reviewers, plus 235 Hyrex findings (our custom rules engine), for a running total of 377 findings at 43% completion.
These are raw numbers, and they will come down. Two things inflate the count by design:
Duplicates. Multiple reviewers can flag the same underlying issue from different angles. A buffer overflow might get caught by Compliance as a standards violation, by Chaos as a crash vector, and by Exploit as an attack surface — all pointing to the same line of code. That is not three bugs. It is one bug that three specialists independently confirmed. Deduplication collapses these into a single finding with multiple corroborating sources, which actually strengthens the finding rather than inflating the count.
False positives. A reviewer might flag code that looks vulnerable in isolation but is safe in context — guarded by a check upstream, constrained by the caller, or dead code that never executes. These are expected in any automated scan. The verification pipeline exists specifically to eliminate them: an adversarial agent tries to prove the finding wrong, and if it succeeds, the finding is discarded. Only what survives gets counted.
What matters is not the raw total but how many survive verification — and whether any of them overlap with what Mythos found.
What Happens Next
Once the scan completes, each finding goes through a three-phase verification pipeline:
- An adversarial verifier agent attempts to disprove each finding
- Static analyzers cross-validate from a different angle
- Surviving findings get a proof-of-concept built in a sandboxed environment
Only findings that pass all three phases count. The goal is not to generate the most findings — it is to generate findings that are real, reproducible, and defensible.
Responsible Disclosure
We will publish aggregate statistics — total verified findings, severity breakdown (critical, high, medium, low, informational) — but nothing that reveals specific vulnerabilities or shows anyone how to exploit Firefox. Verified findings are reported to Mozilla under responsible disclosure. If Mozilla determines the issues are valid, they patch them and release them publicly as CVEs on their own timeline. Until then, the details stay between us and Mozilla.
If a 9B model on a $1,500 GPU can independently rediscover even a fraction of what a 10-trillion parameter model found, that changes the conversation about who gets to do serious security research.
If Roasty comes in 30% under Mythos, it still changes the landscape. A 9B model on a single consumer GPU producing even 70% of what a 10-trillion parameter model found means the barrier to entry for serious security auditing just collapsed. No million-dollar compute budget required.
If Roasty outperforms Mythos, then the entire premise that you need frontier-scale models for deep code security falls apart. It means architecture and specialization beat raw parameter count.
The Architecture Behind Roasty
A tiered memory system offloads context across hot, warm, and cold layers into Atlas — the first deterministic RAG engine in the industry. No probabilistic retrieval. No context loss. That is how a 9B model processes 1.6 billion tokens (44M/tok x 36 agents) in a single scan with pinpoint retrieval accuracy.
Why do we call it the first? Because what the industry calls "deterministic RAG" today is not deterministic at all.
Some systems, like AvocadoDB, market themselves as deterministic because the same vector query returns the same nearest neighbors, verified by SHA-256 hash. That is repeatable retrieval — not deterministic retrieval. The underlying mechanism is still cosine similarity against an embedding space. You are still asking "what is closest?" and getting a fuzzy answer. The fact that you get the same fuzzy answer twice and hash it does not make it precise.
Others achieve the appearance of determinism through hidden prompt engineering — re-ranking layers, constrained decoding, and system-level instructions that the user never sees. The retrieval is still probabilistic, but the output is forced into a consistent shape by a second layer of LLM processing sitting between the retrieval and the response. It looks deterministic from the outside. Underneath, it is a probabilistic retrieval system with a prompt wrapper bolted on top to mask the inconsistency. The moment the underlying model changes, the prompt tuning changes, or the re-ranker drifts — the "determinism" breaks.
Atlas does neither. It executes structured semantic LoreToken queries against a governed index using three-lane scoring — no embeddings, no vector distance, no hidden prompt layers, no "close enough." When Atlas returns a result, it is the result. Not the nearest neighbor. Not the highest-probability match. The result. That distinction is the difference between a search engine and a database — and it is why a 9B model can process 1.6 billion tokens without losing a single finding to retrieval drift.