Conceptual Evolution

This page describes how the challenge appears to be evolving conceptually, based on the challenge rules and the earliest public runs.

Stage 1: “Fit a small language model under 16 MB”

The most naive reading of Parameter Golf is:

train the best small transformer you can, then compress it until it fits.

The public baseline partly lives in this stage, which is useful. A challenge needs an understandable starting point.

But the challenge rules immediately make this framing incomplete.

Stage 2: “The artifact is the model”

Once the byte cap, code budget, and post-roundtrip scoring are taken seriously, the real object being optimized is not the floating-point checkpoint.

It is the full evaluated artifact:

code
tokenizer-related assets
compressed model representation
evaluation behavior under the legal budget

This is the key shift emphasized by Constraints and scoring.

The public runs already support this interpretation because both of them sit close to the byte cap and both expose nontrivial pre-quant vs post-roundtrip gaps.

Stage 3: “Storage, compute, and tokenization become one problem”

Once the artifact is treated as the scored object, several design questions collapse into one coupled optimization problem:

how many unique weights are stored?
how expensive is the vocabulary and output head?
how much can evaluation-time compute recover?
which tensors deserve protection?
how much longer training is worth paying for?

This is why the challenge naturally opens into the lane structure already tracked elsewhere in the garden:

Stage 4: “Baseline scaling is informative but not sufficient”

The unlimited-compute non-record run adds an important historical lesson early:

more training on the baseline family helps
but it does not obviously remove the need for better artifact design

That is conceptually important because it suggests the challenge may not be won by ordinary scaling instincts alone. A team can push the same baseline farther, but the more distinctive gains may come from methods that better match the artifact bottleneck itself.

Stage 5: “The likely frontier is co-design”

The most plausible frontier, based on challenge framing plus the current literature, is not one isolated trick. It is some form of co-design across:

shared or recurrent structure
compression-aware robustness
output-side / tokenizer discipline
bounded extra compute

Relevant paper trail:

What remains uncertain

The public record is still too small to answer:

which family wins on the actual leaderboard
whether tokenizer innovation or recurrent structure matters more
whether evaluation-time compute becomes central or remains niche
how much of the future frontier is training-recipe improvement versus artifact redesign

So the safest historical claim today is not “we know the winning recipe.”

It is:

the challenge has already evolved from a small-model contest into a byte-aware joint optimization problem, but the public run archive is still too young to reveal the final dominant style.

Parameter Golf Research Garden

Section Tree

Conceptual Evolution

Stage 1: “Fit a small language model under 16 MB”

Stage 2: “The artifact is the model”

Stage 3: “Storage, compute, and tokenization become one problem”

Stage 4: “Baseline scaling is informative but not sufficient”

Stage 5: “The likely frontier is co-design”

What remains uncertain

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Conceptual Evolution

Stage 1: “Fit a small language model under 16 MB”

Stage 2: “The artifact is the model”

Stage 3: “Storage, compute, and tokenization become one problem”

Stage 4: “Baseline scaling is informative but not sufficient”

Stage 5: “The likely frontier is co-design”

What remains uncertain

Related

Graph View

Table of Contents

Referenced by

Recent notes