What If the Model Was Built to Think in Compressed Meaning?
This is a theoretical architecture paper, not a product announcement. We have not built this model. We have not tested it. What we have is a compression format -- LoreTokens -- that achieves 30-80x semantic compression in production, and a question that won't leave us alone: what happens if you train a model from scratch to reason directly over compressed meaning instead of natural language?
The Problem Everyone Is Solving the Wrong Way
Every major AI lab is racing to build larger context windows. Claude is at 1M tokens. Gemini claims 2M. The assumption is that bigger windows mean better understanding, and the race is to make them as large as possible.
The problem is physics. The self-attention mechanism at the heart of every transformer is O(n^2) -- every token attends to every other token. Double the context length, quadruple the compute. A 500B token context window would require more memory than exists in any single system on Earth, and more attention operations per layer than could be computed in any reasonable time.
The industry response has been engineering workarounds: sparse attention, sliding windows, linear attention approximations, ring attention across GPU clusters. These work. They've taken us from 4K to 2M tokens in three years. But each technique trades something -- precision, recall quality, or the ability to attend equally to distant tokens. And they're all approaching the same wall from the same direction: make the window bigger.
We're proposing a different direction: make the content smaller.
The Architecture: Dual-Context LoreToken Transformer
The idea is structurally simple. The implications are not.
The flow: user speaks English. Helper compiles to LoreTokens. Main model reasons over LoreTokens. Main model responds. Helper decompresses to English. The user never sees the compressed format. The main model never sees the verbose original.
Why This Is Not RAG
This sounds superficially like retrieval-augmented generation -- an external system feeding context into a model. It is not. The distinction matters.
RAG retrieves chunks of natural language and injects them into a standard context window. The model processes them the same way it processes any other text. The retrieval is external, but the reasoning happens on uncompressed natural language inside a normal attention mechanism.
This architecture is different in a fundamental way: the model is trained to think in compressed semantic representations. LoreTokens are not retrieved chunks pasted into the window. They are the native input format. The model's tokenizer is built for them. The embedding layer maps them. The attention patterns learned during training are shaped around their structure.
The difference is the difference between a bilingual person reading a translation and a native speaker reading their own language. The information content is the same. The processing efficiency is not.
The Math That Makes It Interesting
LoreTokens achieve 30-80x compression on semantic content in production. We have measured this across hundreds of records in our Atlas memory system. A fact that takes 200 tokens to express in English takes 3-6 tokens in LoreToken format -- with no loss of meaning that matters for reasoning.
Apply that compression to a context window:
- A 1M token window with 50x average compression holds the semantic equivalent of 50M tokens of natural language
- A 128K window -- cheap, fast, runs on consumer hardware -- holds the equivalent of 6.4M tokens
- A 32K window on a tiny model holds the equivalent of 1.6M tokens -- more than Claude's full context window
And because the attention mechanism operates on the compressed representation, the O(n^2) cost is computed on the smaller token count. A 1M-equivalent context at 50x compression costs the same compute as a 20K natural-language context. The model sees more while computing less.
A 10B Model That Punches Above Its Weight
This is why we don't need to start with a massive foundation build. A 10B parameter model trained on LoreTokens would be the right first test -- and it would likely be competitive with models far larger than itself on long-context tasks.
Consider: a 10B model with a 128K context window, trained on LoreToken input, would have the effective context reach of 6.4M tokens. That's more than any model in production today. On a single consumer GPU. At inference costs measured in electricity, not API bills.
The model is small, but it sees everything -- because everything has been compressed into a format that preserves meaning while eliminating verbosity. The reasoning capacity is 10B parameters. The information density per token is 50x higher than what those 10B parameters would normally see.
Standard 10B Model
- 128K token context window
- Reads natural language
- 128K tokens of information
- Runs on a single GPU
- Competitive with other 10B models
LoreToken 10B Model (Theoretical)
- 128K token context window
- Reads compressed LoreTokens
- ~6.4M tokens equivalent of information
- Runs on a single GPU
- Effective context reach exceeding all current models
The question is not whether a 10B model can be trained on LoreTokens. The question is whether its reasoning quality on compressed input matches or exceeds its reasoning quality on natural language. If the compression preserves the semantics that matter for reasoning -- and our production data suggests it does -- then the architecture works at any scale. If it doesn't, we'll know from the 10B test.
What Would Need to Be Built
- A LoreToken training corpus. Take a large, diverse text dataset -- Wikipedia, code, books, technical documentation -- and convert it to LoreToken format using an existing model as the helper/compiler. This is the bootstrap step: use a natural-language model to generate the training data for a model that will no longer need natural language.
- A LoreToken-native tokenizer. Standard tokenizers (BPE, SentencePiece) are optimized for English subwords. LoreTokens have a different structure -- key-value pairs, category markers, importance scores, semantic facts. The tokenizer needs to understand that structure, not fight it.
- Standard transformer training on the compressed corpus. The architecture itself doesn't need to be exotic -- standard transformer layers, standard attention. The novelty is in the input format, not the model architecture.
- A helper model or script for real-time compression/decompression at inference time. This could be a small fine-tuned model (1-3B) or, for structured inputs, a deterministic conversion script.
- Evaluation against natural-language models of the same size on the same reasoning tasks, measuring both quality and effective context utilization.
None of these steps require novel research. They require engineering, compute, and the willingness to train a model on something other than raw web text.
What We Don't Know
Honesty matters more than hype. Here is what we don't know:
- Does reasoning quality degrade on compressed input? LoreTokens preserve semantic meaning. They discard phrasing, redundancy, and stylistic variation. It's possible that some reasoning tasks benefit from the redundancy that compression removes -- the way a human sometimes needs to hear something three different ways before they understand it. We won't know until we test.
- How good does the helper model need to be? If the compression is lossy in ways that matter, the main model inherits those losses. The helper is a bottleneck -- its compression quality is the ceiling for the main model's understanding. A bad helper makes the architecture worthless regardless of how good the main model is.
- Does the model generalize? A model trained on LoreToken-compressed Wikipedia might reason well about encyclopedic knowledge. Does it also reason well about code? About conversation? About tasks its training corpus didn't cover? Natural-language models generalize well because natural language is the universal format. LoreTokens are more structured -- that structure might help or hurt generalization.
- What's the optimal compression ratio? 50x compression is aggressive. Maybe 10x preserves more reasoning-relevant detail and still delivers 10x the effective context. The right ratio is an empirical question, not a theoretical one.
Variation: The Embedded Compressor
There's a cleaner version of this architecture that eliminates the separate helper model entirely: build the compression stage directly into the main model as a read-only encoder.
Instead of two models -- a helper that compresses and a main model that reasons -- you'd have one model with two stages. The first few layers (or a dedicated sub-network) handle compression: natural language comes in, dense semantic representations come out. The remaining layers reason over the compressed output. At inference time, the compression layers are frozen -- read-only. They don't update, they don't drift, they just compress deterministically.
The advantages are significant:
- Zero latency between compression and reasoning -- one forward pass, not two model calls
- No information bottleneck -- the compressor learns during training exactly what the reasoning layers need to keep, because they're trained together. The compression is shaped by the reasoner's gradient signal.
- Single deployment -- one model file, one inference call, no orchestration between separate systems
- The compression is optimized for this specific reasoner -- not a general-purpose compression, but compression tuned to preserve exactly the features this model's attention patterns rely on
The trade-off is a larger model file -- perhaps 2-3B parameters for the compression stage on top of a 10B reasoning core, making a 12-13B total. But you're still on a single GPU, and the context window savings dwarf the parameter overhead. This is similar in spirit to DeepMind's Perceiver IO architecture, where a small latent bottleneck cross-attends to a much larger input. The difference is that Perceiver compresses into a learned black-box latent space. This compresses into a structured semantic format with known properties -- keys, values, categories, importance scores. The compressed representation is interpretable, not opaque.
The Ability Nobody Is Aware Of: Multiple Context Windows
There's an assumption buried so deep in the current AI discourse that almost nobody examines it: that a model has one context window.
Why?
When you design a transformer from scratch, the context window is an architectural parameter. You set it during design. You train the model to use it. There is nothing in the mathematics of self-attention that says you can only have one. The single-window design is a convention, not a constraint. It's the way every major model has been built because it's the way the first major models were built.
To be clear: researchers have explored pieces of this territory. Longformer introduced mixed local and global attention patterns. Google's Memorizing Transformers attached a separate kNN memory bank. Compressive Transformers added a compressed memory tier. Perceiver IO uses cross-attention through a latent bottleneck. These are real contributions and they demonstrate that tiered attention works.
But none of them did the obvious next thing: give the model multiple distinct context windows at the architectural level, each with its own size, attention budget, and input format. They added memory mechanisms to a single-window model. That's a patch on the existing design, not a rethinking of the design itself.
Consider what happens if you stop patching and start over:
- A hot window for the current conversation -- small, fast, full attention
- A warm window for project context and accumulated knowledge -- larger, compressed, lower-priority attention
- A cold window for reference material -- very large, heavily compressed, sparse attention, queried only when the hot or warm context references something it contains
This is how human cognition actually works. You're not holding your entire life's knowledge in working memory while you read this sentence. You have immediate focus (what you're reading right now), background context (what this article is about, why you're reading it), and deep storage (everything you've ever learned, accessible when triggered but not actively loaded). Three tiers. Different access patterns. Different costs.
Every model in production today crams all of this into one flat context window and hopes that attention patterns will sort out what matters. They mostly do -- but at enormous computational cost, because every token in the window attends to every other token regardless of whether it's the current sentence or a reference document loaded 50,000 tokens ago.
Multi-window attention would let the model allocate compute where it matters. Hot context gets full O(n^2) attention. Warm context gets cross-attention from the hot window -- the model can look things up in it, but the warm tokens don't attend to each other at full cost. Cold context gets sparse, triggered attention -- only accessed when a query from the hot or warm window activates a retrieval.
Now combine this with LoreToken compression -- and this is the part that hasn't been explored. The hot window is small and fast -- maybe 32K tokens of natural language. The warm window is 128K tokens of LoreToken-compressed project context -- equivalent to 6.4M tokens of natural language. The cold window is persistent storage, compressed even further, holding the accumulated knowledge the model has been given -- bounded by disk, not by attention. Three tiers, three compression levels, three attention budgets. One model.
The existing research proved that tiered attention mechanisms work. What it didn't explore is combining true multi-window architecture with structured semantic compression as the native format for the lower tiers. That combination -- multiple windows where each tier receives increasingly compressed, increasingly structured input -- is what turns a memory augmentation technique into a fundamentally different kind of model.
We don't think of this as thinking outside the box. We don't give recognition to boxes. The single-context-window convention isn't a law of physics -- it's an architectural choice that was never revisited as compression technology matured. The question isn't whether multi-window transformers are possible. The question is why nobody has combined them with structured semantic compression to build a model that reasons natively over meaning at multiple scales.
Why We're Publishing This Before Building It
Because the idea is more valuable shared than hoarded.
We have the compression format. LoreTokens are in production, achieving measured compression ratios across real workloads. We have the retrieval engine -- Atlas -- that already proves deterministic retrieval over compressed semantic content works at scale. What we don't have, yet, is the compute budget to train a foundation model from scratch.
A 10B model training run is feasible -- under $1M in compute based on what Qwen, Mistral, and others have demonstrated at this scale. We hope to be able to test-build soon. When we do, the results -- positive or negative -- will be published here.
In the meantime, the architecture is open for discussion. If you're a researcher, a lab, or a funder who reads this and thinks "that's testable," we agree. That's why we wrote it down.
About this article: This is a theoretical architecture proposal from the team behind SAIQL, Atlas, and LoreTokens. The compression format is real and in production. The model architecture is theoretical and untested. We are not claiming this works -- we are claiming it's worth testing, and explaining why. Read about the stack that would run it.