The atlas is the garden’s serious map of the problem space. It is meant to answer a harder question than “what pages exist?”

Given the actual challenge rules, where is the pressure coming from, which research lanes matter most, how do they interact, and what still looks unresolved?

If the challenge section explains the game, this page maps the battlefield.

1. The challenge in one research sentence

Parameter Golf asks for the strongest language model that can survive a self-contained 16 MB artifact budget and still score well on the real evaluation path. That means the frontier is defined by quality per stored byte, not by parameter count alone.

2. Primary forces shaping the research map

Constraint or pressureWhy it mattersResearch lanes it activates
16 MB total artifact capForces explicit byte allocation across weights, code, tokenizer assets, and metadataRecursive sharing, Tokenizer and vocabulary, Quantization and outliers
Score is taken on the actual artifact pathMakes post-export robustness a first-class modeling objectiveQuantization and outliers, RMSNorm stabilized scaling, Sparse outlier preservation
Evaluation-time budget exists but is not zeroMakes compute-for-bytes trades viable instead of purely theoreticalEvaluation-time compute, Recursive sharing
Training budget is tightRewards ideas that converge quickly and avoid long fragile optimization storiesTraining economics, Plan Early!, SLM compute bottlenecks
Metric is bits per byte rather than bits per tokenBrings tokenizer choice, output-side cost, and sequence length back into the core problemTokenizer and vocabulary, ReTok, Vocabulary Compression

3. Public-record orientation

Before going deep into mechanisms, it helps to know what the public challenge record already made obvious.

What the baseline era taught

The public baseline established that:

  • the real target is the post-roundtrip artifact, not the floating-point model
  • byte accounting has to include more than just raw weights
  • export degradation is large enough to reshape model design
  • straightforward small dense models are only the starting point, not the endgame

Start here:

What the visible frontier already seems to cover

The open conversation around the challenge has already made these areas central:

  • shared depth and parameter tying
  • low-bit robustness and compression-aware training
  • output-side and vocabulary discipline
  • basic forms of evaluation-side adaptation

So the atlas treats those as core lanes, not speculative side notes.

4. The main research lanes

Lane A: Recursive sharing and effective depth

Core question: can we buy much more effective model capacity by reusing a small amount of structure many times?

Why this lane is central:

  • stored uniqueness is expensive
  • recurrence converts compute into depth
  • shared blocks may let width grow under the same byte budget

Important pages:

Main open risk:

  • extreme sharing can underfit or become too rigid unless the saved bytes are reinvested wisely

What would count as a decisive advance:

  • a recurrent or shared-depth design that beats a similarly budgeted dense design after the real artifact path, not just before export

Lane B: Quantization, outliers, and artifact-aware robustness

Core question: how do we train models whose quality survives the exact export and decompression path the challenge cares about?

Why this lane is central:

  • post-export quality is the true bottleneck for many otherwise-good models
  • small activation-scale or outlier problems can dominate roundtrip degradation
  • the lane often yields improvements without changing the broad architecture at all

Important pages:

Main open risk:

  • some fixes improve robustness locally but consume too many bytes, too much runtime, or too much implementation complexity to remain challenge-optimal

What would count as a decisive advance:

  • a method that materially shrinks the pre→post score gap while keeping byte overhead negligible or clearly worthwhile

Lane C: Tokenizer, vocabulary, and output-side efficiency

Core question: can we reduce the cost of representing text without paying too much in artifact bytes or sequence-length pain?

Why this lane is central:

  • the challenge metric is bits per byte
  • output layers and vocabulary assets are expensive in small models
  • tokenization choices reshape both compute and artifact cost

Important pages:

Main open risk:

  • a tokenizer idea can look elegant but lose once artifact size, sequence length, and engineering overhead are all counted honestly

What would count as a decisive advance:

  • a tokenizer or output-head change that improves final score per byte instead of merely reducing token count in isolation

Lane D: Evaluation-time compute

Core question: what capability is cheaper to recompute during evaluation than to store permanently in the artifact?

Why this lane is central:

  • the challenge allows nontrivial evaluation behavior within a time cap
  • recurrent models become much more interesting if extra unrolling is useful
  • tiny models may benefit more from carefully budgeted extra passes than large models do

Important pages:

Main open risk:

  • evaluation-time tricks can win on paper but fail the runtime budget or become too brittle to trust

What would count as a decisive advance:

  • a compact model whose score improves reliably as evaluation compute increases, with a clear best point still inside the budget

Lane E: Training economics and small-model bottlenecks

Core question: which ideas are actually compatible with the challenge’s limited optimization budget?

Why this lane matters:

  • not every theoretically good compression method can be trained fast enough
  • small models have their own optimization pathologies
  • research time should prioritize ideas that converge under the real constraints

Important pages:

Main open risk:

  • spending months on methods whose best-case version is still too slow, too unstable, or too code-heavy for the challenge

What would count as a decisive advance:

  • a method that is not only byte-efficient but also easy to train into the relevant quality regime within the contest budget

5. The highest-value interactions between lanes

The challenge is unlikely to be won by a single isolated trick. The most important interactions are:

Recursive sharing × evaluation-time compute

Shared blocks become much more valuable if extra unrolling at evaluation time can recover some of the capacity lost by removing unique depth.

Quantization robustness × selective precision

Outlier control and better normalization set up the model for cheaper storage; selective precision then spends expensive bytes only on the truly fragile parts.

Tokenizer efficiency × output-head efficiency

Smarter text representation can reduce both sequence costs and output-side storage pressure, which can free bytes for the core model.

Training economics × every other lane

A beautiful idea that cannot be optimized quickly enough is strategically weaker than a slightly less elegant idea that reliably lands in the good region under the real budget.

6. The most important unresolved questions

These are the questions that seem to matter most at the current frontier:

  1. How far can structure sharing go before the loss in specialization outweighs the byte savings?
  2. Is the best compression path still “dense low-bit plus safeguards,” or does the challenge want a more radical artifact format?
  3. Are tokenizer changes underexploited, or are they mostly a distraction once full artifact costs are counted?
  4. Can evaluation-time latent refinement deliver real score gains without blowing the time cap?
  5. Which small training stabilizers actually survive the final artifact path and not just the floating-point checkpoint?

7. Reading routes through the garden

Route 1: Understand the challenge first

Route 2: Understand the byte-saving architecture case

Route 3: Understand the artifact-robustness case

Route 4: Understand the underexplored frontier

Route 5: Understand what this repository has already tried locally

8. Bottom line

The Parameter Golf research landscape is best understood as a constrained trade study between:

  • stored structure
  • recoverable structure
  • evaluation compute
  • artifact robustness
  • training efficiency

The central strategic question is always the same:

Which combination of structure sharing, compression robustness, vocabulary design, and evaluation behavior buys the most real score per byte under the actual challenge rules?

  • stored structure
  • recoverable structure
  • evaluation compute
  • artifact robustness
  • training efficiency

The central strategic question is always the same:

Which combination of structure sharing, compression robustness, vocabulary design, and evaluation behavior buys the most real score per byte under the actual challenge rules?