The atlas is the garden’s serious map of the problem space. It is meant to answer a harder question than “what pages exist?”
Given the actual challenge rules, where is the pressure coming from, which research lanes matter most, how do they interact, and what still looks unresolved?
If the challenge section explains the game, this page maps the battlefield.
1. The challenge in one research sentence
Parameter Golf asks for the strongest language model that can survive a self-contained 16 MB artifact budget and still score well on the real evaluation path. That means the frontier is defined by quality per stored byte, not by parameter count alone.
2. Primary forces shaping the research map
| Constraint or pressure | Why it matters | Research lanes it activates |
|---|---|---|
| 16 MB total artifact cap | Forces explicit byte allocation across weights, code, tokenizer assets, and metadata | Recursive sharing, Tokenizer and vocabulary, Quantization and outliers |
| Score is taken on the actual artifact path | Makes post-export robustness a first-class modeling objective | Quantization and outliers, RMSNorm stabilized scaling, Sparse outlier preservation |
| Evaluation-time budget exists but is not zero | Makes compute-for-bytes trades viable instead of purely theoretical | Evaluation-time compute, Recursive sharing |
| Training budget is tight | Rewards ideas that converge quickly and avoid long fragile optimization stories | Training economics, Plan Early!, SLM compute bottlenecks |
| Metric is bits per byte rather than bits per token | Brings tokenizer choice, output-side cost, and sequence length back into the core problem | Tokenizer and vocabulary, ReTok, Vocabulary Compression |
3. Public-record orientation
Before going deep into mechanisms, it helps to know what the public challenge record already made obvious.
What the baseline era taught
The public baseline established that:
- the real target is the post-roundtrip artifact, not the floating-point model
- byte accounting has to include more than just raw weights
- export degradation is large enough to reshape model design
- straightforward small dense models are only the starting point, not the endgame
Start here:
What the visible frontier already seems to cover
The open conversation around the challenge has already made these areas central:
- shared depth and parameter tying
- low-bit robustness and compression-aware training
- output-side and vocabulary discipline
- basic forms of evaluation-side adaptation
So the atlas treats those as core lanes, not speculative side notes.
4. The main research lanes
Lane A: Recursive sharing and effective depth
Core question: can we buy much more effective model capacity by reusing a small amount of structure many times?
Why this lane is central:
- stored uniqueness is expensive
- recurrence converts compute into depth
- shared blocks may let width grow under the same byte budget
Important pages:
- Recursive and shared-parameter architectures
- Recursive width scaling
- Recurrent wide architecture
- Relaxed Recursive Transformers
- MoEUT
- Fine-grained Parameter Sharing
Main open risk:
- extreme sharing can underfit or become too rigid unless the saved bytes are reinvested wisely
What would count as a decisive advance:
- a recurrent or shared-depth design that beats a similarly budgeted dense design after the real artifact path, not just before export
Lane B: Quantization, outliers, and artifact-aware robustness
Core question: how do we train models whose quality survives the exact export and decompression path the challenge cares about?
Why this lane is central:
- post-export quality is the true bottleneck for many otherwise-good models
- small activation-scale or outlier problems can dominate roundtrip degradation
- the lane often yields improvements without changing the broad architecture at all
Important pages:
- Quantization and outliers
- RMSNorm stabilized scaling
- Sparse outlier preservation
- Extra RMSNorm
- QuEST
- pQuant
- ClusComp
- MicroScopiQ
Main open risk:
- some fixes improve robustness locally but consume too many bytes, too much runtime, or too much implementation complexity to remain challenge-optimal
What would count as a decisive advance:
- a method that materially shrinks the pre→post score gap while keeping byte overhead negligible or clearly worthwhile
Lane C: Tokenizer, vocabulary, and output-side efficiency
Core question: can we reduce the cost of representing text without paying too much in artifact bytes or sequence-length pain?
Why this lane is central:
- the challenge metric is bits per byte
- output layers and vocabulary assets are expensive in small models
- tokenization choices reshape both compute and artifact cost
Important pages:
- Tokenizer and vocabulary efficiency
- Tokenizer efficiency
- ReTok
- Vocabulary Compression
- Tokenizer Evaluation Across Scales
Main open risk:
- a tokenizer idea can look elegant but lose once artifact size, sequence length, and engineering overhead are all counted honestly
What would count as a decisive advance:
- a tokenizer or output-head change that improves final score per byte instead of merely reducing token count in isolation
Lane D: Evaluation-time compute
Core question: what capability is cheaper to recompute during evaluation than to store permanently in the artifact?
Why this lane is central:
- the challenge allows nontrivial evaluation behavior within a time cap
- recurrent models become much more interesting if extra unrolling is useful
- tiny models may benefit more from carefully budgeted extra passes than large models do
Important pages:
Main open risk:
- evaluation-time tricks can win on paper but fail the runtime budget or become too brittle to trust
What would count as a decisive advance:
- a compact model whose score improves reliably as evaluation compute increases, with a clear best point still inside the budget
Lane E: Training economics and small-model bottlenecks
Core question: which ideas are actually compatible with the challenge’s limited optimization budget?
Why this lane matters:
- not every theoretically good compression method can be trained fast enough
- small models have their own optimization pathologies
- research time should prioritize ideas that converge under the real constraints
Important pages:
- Training economics and bottlenecks
- Computational Bottlenecks of Training SLMs
- Need a Small Specialized LM? Plan Early!
Main open risk:
- spending months on methods whose best-case version is still too slow, too unstable, or too code-heavy for the challenge
What would count as a decisive advance:
- a method that is not only byte-efficient but also easy to train into the relevant quality regime within the contest budget
5. The highest-value interactions between lanes
The challenge is unlikely to be won by a single isolated trick. The most important interactions are:
Recursive sharing × evaluation-time compute
Shared blocks become much more valuable if extra unrolling at evaluation time can recover some of the capacity lost by removing unique depth.
Quantization robustness × selective precision
Outlier control and better normalization set up the model for cheaper storage; selective precision then spends expensive bytes only on the truly fragile parts.
Tokenizer efficiency × output-head efficiency
Smarter text representation can reduce both sequence costs and output-side storage pressure, which can free bytes for the core model.
Training economics × every other lane
A beautiful idea that cannot be optimized quickly enough is strategically weaker than a slightly less elegant idea that reliably lands in the good region under the real budget.
6. The most important unresolved questions
These are the questions that seem to matter most at the current frontier:
- How far can structure sharing go before the loss in specialization outweighs the byte savings?
- Is the best compression path still “dense low-bit plus safeguards,” or does the challenge want a more radical artifact format?
- Are tokenizer changes underexploited, or are they mostly a distraction once full artifact costs are counted?
- Can evaluation-time latent refinement deliver real score gains without blowing the time cap?
- Which small training stabilizers actually survive the final artifact path and not just the floating-point checkpoint?
7. Reading routes through the garden
Route 1: Understand the challenge first
- Challenge overview
- Constraints and scoring
- History and public runs
- How to read the leaderboard and public records
Route 2: Understand the byte-saving architecture case
- Recursive and shared-parameter architectures
- Recursive width scaling
- Recurrent wide architecture
- Relaxed Recursive Transformers
- MoEUT
Route 3: Understand the artifact-robustness case
- Quantization and outliers
- RMSNorm stabilized scaling
- Sparse outlier preservation
- Extra RMSNorm
- pQuant
- QuEST
Route 4: Understand the underexplored frontier
- Public research directions
- Research frontiers
- Research ideas
- Moonshots
- Tokenizer and vocabulary efficiency
- Evaluation-time compute
- Vocabulary Compression
- Inference Scaling Laws
Route 5: Understand what this repository has already tried locally
- Local experiment history
- Local research lineage
- Kept runs and turning points
- Dead ends and failed directions
8. Bottom line
The Parameter Golf research landscape is best understood as a constrained trade study between:
- stored structure
- recoverable structure
- evaluation compute
- artifact robustness
- training efficiency
The central strategic question is always the same:
Which combination of structure sharing, compression robustness, vocabulary design, and evaluation behavior buys the most real score per byte under the actual challenge rules?
- stored structure
- recoverable structure
- evaluation compute
- artifact robustness
- training efficiency
The central strategic question is always the same:
Which combination of structure sharing, compression robustness, vocabulary design, and evaluation behavior buys the most real score per byte under the actual challenge rules?