Parameter Golf Research Atlas

The atlas is the garden’s serious map of the problem space. It is meant to answer a harder question than “what pages exist?”

Given the actual challenge rules, where is the pressure coming from, which research lanes matter most, how do they interact, and what still looks unresolved?

If the challenge section explains the game, this page maps the battlefield.

1. The challenge in one research sentence

Parameter Golf asks for the strongest language model that can survive a self-contained 16 MB artifact budget and still score well on the real evaluation path. That means the frontier is defined by quality per stored byte, not by parameter count alone.

2. Primary forces shaping the research map

Constraint or pressure	Why it matters	Research lanes it activates
16 MB total artifact cap	Forces explicit byte allocation across weights, code, tokenizer assets, and metadata	Recursive sharing, Tokenizer and vocabulary, Quantization and outliers
Score is taken on the actual artifact path	Makes post-export robustness a first-class modeling objective	Quantization and outliers, RMSNorm stabilized scaling, Sparse outlier preservation
Evaluation-time budget exists but is not zero	Makes compute-for-bytes trades viable instead of purely theoretical	Evaluation-time compute, Recursive sharing
Training budget is tight	Rewards ideas that converge quickly and avoid long fragile optimization stories	Training economics, Plan Early!, SLM compute bottlenecks
Metric is bits per byte rather than bits per token	Brings tokenizer choice, output-side cost, and sequence length back into the core problem	Tokenizer and vocabulary, ReTok, Vocabulary Compression

3. Public-record orientation

Before going deep into mechanisms, it helps to know what the public challenge record already made obvious.

What the baseline era taught

The public baseline established that:

the real target is the post-roundtrip artifact, not the floating-point model
byte accounting has to include more than just raw weights
export degradation is large enough to reshape model design
straightforward small dense models are only the starting point, not the endgame

Start here:

What the visible frontier already seems to cover

The open conversation around the challenge has already made these areas central:

shared depth and parameter tying
low-bit robustness and compression-aware training
output-side and vocabulary discipline
basic forms of evaluation-side adaptation

So the atlas treats those as core lanes, not speculative side notes.

4. The main research lanes

Core question: can we buy much more effective model capacity by reusing a small amount of structure many times?

Why this lane is central:

stored uniqueness is expensive
recurrence converts compute into depth
shared blocks may let width grow under the same byte budget

Important pages:

Main open risk:

extreme sharing can underfit or become too rigid unless the saved bytes are reinvested wisely

What would count as a decisive advance:

a recurrent or shared-depth design that beats a similarly budgeted dense design after the real artifact path, not just before export

Lane B: Quantization, outliers, and artifact-aware robustness

Core question: how do we train models whose quality survives the exact export and decompression path the challenge cares about?

Why this lane is central:

post-export quality is the true bottleneck for many otherwise-good models
small activation-scale or outlier problems can dominate roundtrip degradation
the lane often yields improvements without changing the broad architecture at all

Important pages:

Main open risk:

some fixes improve robustness locally but consume too many bytes, too much runtime, or too much implementation complexity to remain challenge-optimal

What would count as a decisive advance:

a method that materially shrinks the pre→post score gap while keeping byte overhead negligible or clearly worthwhile

Lane C: Tokenizer, vocabulary, and output-side efficiency

Core question: can we reduce the cost of representing text without paying too much in artifact bytes or sequence-length pain?

Why this lane is central:

the challenge metric is bits per byte
output layers and vocabulary assets are expensive in small models
tokenization choices reshape both compute and artifact cost

Important pages:

Main open risk:

a tokenizer idea can look elegant but lose once artifact size, sequence length, and engineering overhead are all counted honestly

What would count as a decisive advance:

a tokenizer or output-head change that improves final score per byte instead of merely reducing token count in isolation

Lane D: Evaluation-time compute

Core question: what capability is cheaper to recompute during evaluation than to store permanently in the artifact?

Why this lane is central:

the challenge allows nontrivial evaluation behavior within a time cap
recurrent models become much more interesting if extra unrolling is useful
tiny models may benefit more from carefully budgeted extra passes than large models do

Important pages:

Main open risk:

evaluation-time tricks can win on paper but fail the runtime budget or become too brittle to trust

What would count as a decisive advance:

a compact model whose score improves reliably as evaluation compute increases, with a clear best point still inside the budget

Lane E: Training economics and small-model bottlenecks

Core question: which ideas are actually compatible with the challenge’s limited optimization budget?

Why this lane matters:

not every theoretically good compression method can be trained fast enough
small models have their own optimization pathologies
research time should prioritize ideas that converge under the real constraints

Important pages:

Main open risk:

spending months on methods whose best-case version is still too slow, too unstable, or too code-heavy for the challenge

What would count as a decisive advance:

a method that is not only byte-efficient but also easy to train into the relevant quality regime within the contest budget

5. The highest-value interactions between lanes

The challenge is unlikely to be won by a single isolated trick. The most important interactions are:

Shared blocks become much more valuable if extra unrolling at evaluation time can recover some of the capacity lost by removing unique depth.

Quantization robustness × selective precision

Outlier control and better normalization set up the model for cheaper storage; selective precision then spends expensive bytes only on the truly fragile parts.

Tokenizer efficiency × output-head efficiency

Smarter text representation can reduce both sequence costs and output-side storage pressure, which can free bytes for the core model.

Training economics × every other lane

A beautiful idea that cannot be optimized quickly enough is strategically weaker than a slightly less elegant idea that reliably lands in the good region under the real budget.

6. The most important unresolved questions

These are the questions that seem to matter most at the current frontier:

How far can structure sharing go before the loss in specialization outweighs the byte savings?
Is the best compression path still “dense low-bit plus safeguards,” or does the challenge want a more radical artifact format?
Are tokenizer changes underexploited, or are they mostly a distraction once full artifact costs are counted?
Can evaluation-time latent refinement deliver real score gains without blowing the time cap?
Which small training stabilizers actually survive the final artifact path and not just the floating-point checkpoint?

7. Reading routes through the garden

Route 1: Understand the challenge first

Route 2: Understand the byte-saving architecture case

Route 3: Understand the artifact-robustness case

Route 4: Understand the underexplored frontier

Route 5: Understand what this repository has already tried locally

8. Bottom line

The Parameter Golf research landscape is best understood as a constrained trade study between:

stored structure
recoverable structure
evaluation compute
artifact robustness
training efficiency

The central strategic question is always the same:

Which combination of structure sharing, compression robustness, vocabulary design, and evaluation behavior buys the most real score per byte under the actual challenge rules?

stored structure
recoverable structure
evaluation compute
artifact robustness
training efficiency

The central strategic question is always the same:

Which combination of structure sharing, compression robustness, vocabulary design, and evaluation behavior buys the most real score per byte under the actual challenge rules?

Parameter Golf Research Garden

Section Tree

Parameter Golf Research Atlas

1. The challenge in one research sentence

2. Primary forces shaping the research map

3. Public-record orientation

What the baseline era taught

What the visible frontier already seems to cover

4. The main research lanes

Lane B: Quantization, outliers, and artifact-aware robustness

Lane C: Tokenizer, vocabulary, and output-side efficiency

Lane D: Evaluation-time compute

Lane E: Training economics and small-model bottlenecks

5. The highest-value interactions between lanes

Quantization robustness × selective precision

Tokenizer efficiency × output-head efficiency

Training economics × every other lane

6. The most important unresolved questions

7. Reading routes through the garden

Route 1: Understand the challenge first

Route 2: Understand the byte-saving architecture case

Route 3: Understand the artifact-robustness case

Route 4: Understand the underexplored frontier

Route 5: Understand what this repository has already tried locally

8. Bottom line

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Parameter Golf Research Atlas

1. The challenge in one research sentence

2. Primary forces shaping the research map

3. Public-record orientation

What the baseline era taught

What the visible frontier already seems to cover

4. The main research lanes

Lane A: Recursive sharing and effective depth

Lane B: Quantization, outliers, and artifact-aware robustness

Lane C: Tokenizer, vocabulary, and output-side efficiency

Lane D: Evaluation-time compute

Lane E: Training economics and small-model bottlenecks

5. The highest-value interactions between lanes

Recursive sharing × evaluation-time compute

Quantization robustness × selective precision

Tokenizer efficiency × output-head efficiency

Training economics × every other lane

6. The most important unresolved questions

7. Reading routes through the garden

Route 1: Understand the challenge first

Route 2: Understand the byte-saving architecture case

Route 3: Understand the artifact-robustness case

Route 4: Understand the underexplored frontier

Route 5: Understand what this repository has already tried locally

8. Bottom line

Graph View

Table of Contents

Referenced by

Recent notes