975,394 Tokens. Every Cross-Reference. One Local Model.

June 7, 2026 | Research

Nobody Fully Understands a 3,000-Page Congressional Bill.
Not Human, Not Even AI. That Is About to Change.

975,394 tokens cross-referenced by Atlas

We scanned the entire National Defense Authorization Act for Fiscal Year 2026 with a 14B model on consumer hardware. Over 3,000 pages. 975,394 tokens. Every section cross-referenced against every other section -- without context loss. The downloadable results are at the bottom of this article. No foundational model on earth can do this.

3,000+

Pages (1,260 condensed)

975,394

Tokens processed

336

Sections analyzed

14B

Model parameters

Why no AI model can do this in a single pass

The biggest models on the planet advertise up to 1M token context windows. This bill is just under 1M tokens. So in theory, a top-tier model could fit it. In practice, it cannot do what we did. That includes the newly released Claude Mythos 5 / Fable 5.

Context windows have a dirty secret called "lost in the middle." Models pay strong attention to the beginning of their context and the end. Everything in between degrades. Researchers have documented this extensively. A provision on page 30 and a contradicting clause on page 1,200 are so far apart in the context that attention mechanisms fail to connect them reliably. You can fit a million tokens in the window and still miss the interactions between them.

Fitting the text is not the same as understanding it.

Why chunking fails

The standard workaround is chunking -- break the document into pieces, analyze each piece separately. Every AI document tool does this. The problem is obvious: once you chunk, each piece becomes an island. The model processes chunk 1, forgets it, moves to chunk 2.

A provision buried on page 47 that quietly guts a protection established on page 1,147? Those chunks never meet. No model connects them. The very structure that makes a 3,000-page bill effective at hiding things is the same structure that defeats every chunking approach.

Why RAG does not solve this

Retrieval-Augmented Generation embeds chunks into vectors and runs similarity search to pull "relevant" context. It is probabilistic -- it guesses which chunks might be related based on how similar they look in embedding space.

Legal language is adversarial to this approach. "Section 1043(b)(2)" and the paragraph it amends share zero semantic similarity in vector space. They use completely different terminology. They describe different things. The only connection between them is a section number reference -- and embedding similarity will never find it. RAG will not retrieve one when processing the other.

How Atlas solves this

Atlas is not RAG. It is a deterministic retrieval engine.

Every chunk of the bill gets ingested into Atlas as it is parsed. When chunk 280 references "Section 1043," Atlas performs an exact lookup and returns the actual text of Section 1043 from wherever it lives in the document. No embedding similarity. No probabilistic guessing. The referenced text is retrieved and handed to the model alongside the current chunk. Every time. With zero degradation regardless of how far apart the sections are in the original document.

The model doing the analysis is Qwen2.5-14B running locally on consumer hardware with a 64K context window. It sees roughly 6.5% of the bill at a time. But it does not need to see more. Atlas provides the cross-references on demand. The model analyzes what is in front of it. Atlas tells it what is connected to it.

Together, a 14B model on consumer hardware covers a million tokens with cross-referencing that even the most expensive models with the largest context windows cannot match -- because retrieval beats attention at scale.

Why the NDAA is 3,000 pages

The NDAA is not long because defense policy is complicated. It is long because length is the strategy.

Provisions are buried. Funding is authorized in one section and the constraints on that funding appear 800 pages later. Exemptions reference subsections of amendments to previous years' acts. Corporate subsidies are wrapped in national security language. Spending caps are established early and then quietly raised in later sections that few people read.

The bill is structurally designed so that no single reader -- human or AI -- can hold the full picture at once. That is not a limitation of the reader. That is the point.

Atlas does not hold the full picture at once either. It does not need to. It holds every piece, knows where every piece is, and retrieves exactly the right pieces for whatever is being analyzed right now. Deterministic. Auditable. Same query, same data, same answer, every time.

How a bill actually works (and why nobody reads them)

Most people assume a bill reads like a document -- that if you want to understand a provision, you read the section that describes it. That is not how bills work.

A single provision in the NDAA might be spread across five or more completely separate parts of the bill. The description of what the program does is in one section. The authorization of funding is in a different title, sometimes hundreds of pages later. The constraints on how that money can be spent are in yet another section. Who is eligible, who is excluded, and who oversees it are scattered across different subsections that reference amendments to previous years' acts. The expiration date or sunset clause might be buried in a miscellaneous provisions section at the very end.

To understand one thing the bill does, you need to find and connect all of those pieces. Multiply that by hundreds of provisions, and you begin to see why nobody -- not congressional staffers, not lobbyists, not journalists -- reads the whole bill. They read the parts they already know to look for and miss everything else. That is the design.

Our scan reassembles the full picture. For every provision, Atlas traces the cross-references across the entire bill and brings them together so the analysis includes not just what a section says, but what it actually does when connected to its funding, constraints, exemptions, and interactions with other provisions.

What the scan produces

A section-by-section breakdown of the entire bill in plain English. What each provision actually does. Who benefits. Who pays. How provisions scattered across 3,000+ pages interact with each other. Every finding is color-coded by impact on the average citizen:

Green -- directly helps citizens. Red -- directly hurts citizens. Yellow -- mixed or uncertain. Blue -- worth knowing. Gray -- procedural.

Cross-references that a human analyst would need weeks to map. Cross-references that no existing AI tool can make because they either cannot fit the document or cannot connect the pieces once they chunk it.

This is not the first time

Before the NDAA, we used Atlas to scan the entire Mozilla Firefox source repository -- 44 million tokens of C, C++, JavaScript, and Rust -- with a 9B model on a single GPU. Atlas gave that 9B model total recall across the entire codebase. It found 72 confirmed vulnerabilities, including 4 multi-step exploit chains that spanned multiple files and directories. Cross-file interactions that no model could see within a single context window.

The congressional scanner is the same architecture applied to a different problem. Code security and legislative analysis have nothing in common on the surface. But the underlying challenge is identical: a document too large for any context window, where the important interactions happen between pieces that are far apart. Atlas does not care what the content is. It retrieves by reference, not by topic.

The numbers

975,394 tokens. One 14B model. One retrieval engine. Consumer GPUs. No cloud. No API costs. No context window large enough to matter -- and it did not need one.

The model did not need to be bigger. It needed better infrastructure around it.

What this changes

Until now, nobody could read a 3,000-page bill and trace every cross-reference. Not a citizen. Not a politician. Not a team of congressional staffers. Not a law firm billing $800 an hour. The bills were unreadable by design, and everyone just accepted that.

This changes that. A citizen can upload the bill their representative voted on and find out what it actually does -- all of it, not just the press release version. A politician can upload the bill they are being asked to vote on and see what is buried on page 1,100 before they sign on page 1. Deterministic cross-referencing across every section, on every page, in plain English.

The arms race that does not work

Congress is already drafting legislation that will regulate AI, reshape technology policy, and affect every company and citizen in this space. Those bills will not just be as long and complex as what came before -- they will be worse. Legislators will use AI to help write them. AI-assisted legislation will be longer, more intricately cross-referenced, and more deliberately layered than anything a human team could produce alone. GPT-35, Claude 20, whatever comes next -- the drafting tools will keep getting more powerful, and the bills they produce will keep getting harder for humans to read.

That does not help them. Atlas retrieves by reference. The more cross-references they add, the more Atlas traces. A bill drafted by GPT-35 is just more input. The complexity of the drafting tool does not outpace the scanning tool -- they scale together, except one side is public.

That capability did not exist before. Not from any lab, at any price, at any scale.

A note to investors

If you read this and do not understand what you are looking at, that is fine. Keep walking. Let another one pass you by.

Atlas is not a research project. It is in production right now. On ShipItClean.com, Atlas powers a codebase security scanner that ingests entire GitHub repositories and ZIP archives of any size -- millions of tokens of source code -- and cross-references every file against every other file to find vulnerabilities, exploit chains, and logic errors that span multiple files and directories. It uses API-hosted models. The NDAA scan above used a local 14B model. Atlas does not care what model sits behind it or where that model runs. It gives any model the same capability: unbounded, deterministic, cross-referenced memory.

We are building a separate document scanning service on top of Atlas. It already handles:

Congressional bills -- full cross-referenced analysis from any perspective (citizen, business, politician, minority, government)
Contracts -- hidden clauses, liability traps, termination gotchas, party-perspective analysis
Insurance policies -- coverage gaps, sublimits, deductible traps, exclusion mapping
Tax returns -- missed deductions, audit red flags, filing consistency checks
Patent applications -- claim breadth, prior art exposure, prosecution vulnerabilities
Legal discovery -- cross-referenced depositions, interrogatories, contradiction surfacing
Regulatory filings -- SEC, FDA, EPA, FCC compliance gap analysis
Academic papers -- methodology rigor, statistical validity, reproducibility assessment
Religious texts -- multi-perspective analysis across scripture (scholar, believer, skeptic, historian, ethicist)
Resumes, credit reports, leases, medical bills, terms of service, HOA bylaws -- consumer document scanning
Traffic tickets -- jurisdiction-specific dismissal strategies and procedural defect analysis
HR professional tools -- employment contract scanning, employee handbook analysis, job description matching, bulk resume processing, talent pool management

Every scanner uses Atlas. Every scanner cross-references deterministically. The same engine that traced 975,394 tokens of legislative text traces insurance exclusions across a 200-page policy, contradictions across depositions in a legal case, and vulnerability chains across a million-line codebase. The domain changes. The architecture does not.

If you understand what a deterministic retrieval engine that gives any model unbounded cross-referenced memory means -- for legal, for compliance, for intelligence, for legislation, for code security, for any domain where documents are too large and too interconnected for any context window -- then my signature below is a direct line. This is worth a conversation.

📄 Download the full NDAA FY2026 scan results (DOCX)

Apollo Raines | ShipItClean.com | SAIQL.ai

Nobody Fully Understands a 3,000-Page Congressional Bill.Not Human, Not Even AI. That Is About to Change.