This page describes how the challenge appears to be evolving conceptually, based on the challenge rules and the earliest public runs.
Stage 1: “Fit a small language model under 16 MB”
The most naive reading of Parameter Golf is:
train the best small transformer you can, then compress it until it fits.
The public baseline partly lives in this stage, which is useful. A challenge needs an understandable starting point.
But the challenge rules immediately make this framing incomplete.
Stage 2: “The artifact is the model”
Once the byte cap, code budget, and post-roundtrip scoring are taken seriously, the real object being optimized is not the floating-point checkpoint.
It is the full evaluated artifact:
- code
- tokenizer-related assets
- compressed model representation
- evaluation behavior under the legal budget
This is the key shift emphasized by Constraints and scoring.
The public runs already support this interpretation because both of them sit close to the byte cap and both expose nontrivial pre-quant vs post-roundtrip gaps.
Stage 3: “Storage, compute, and tokenization become one problem”
Once the artifact is treated as the scored object, several design questions collapse into one coupled optimization problem:
- how many unique weights are stored?
- how expensive is the vocabulary and output head?
- how much can evaluation-time compute recover?
- which tensors deserve protection?
- how much longer training is worth paying for?
This is why the challenge naturally opens into the lane structure already tracked elsewhere in the garden:
- recursive sharing
- quantization and outlier handling
- tokenizer and vocabulary efficiency
- evaluation-time compute
- training economics
Stage 4: “Baseline scaling is informative but not sufficient”
The unlimited-compute non-record run adds an important historical lesson early:
- more training on the baseline family helps
- but it does not obviously remove the need for better artifact design
That is conceptually important because it suggests the challenge may not be won by ordinary scaling instincts alone. A team can push the same baseline farther, but the more distinctive gains may come from methods that better match the artifact bottleneck itself.
Stage 5: “The likely frontier is co-design”
The most plausible frontier, based on challenge framing plus the current literature, is not one isolated trick. It is some form of co-design across:
- shared or recurrent structure
- compression-aware robustness
- output-side / tokenizer discipline
- bounded extra compute
Relevant paper trail:
- Relaxed Recursive Transformers
- Fine-grained Parameter Sharing
- Extra RMSNorm
- pQuant
- ReTok
- Inference Scaling Laws
What remains uncertain
The public record is still too small to answer:
- which family wins on the actual leaderboard
- whether tokenizer innovation or recurrent structure matters more
- whether evaluation-time compute becomes central or remains niche
- how much of the future frontier is training-recipe improvement versus artifact redesign
So the safest historical claim today is not “we know the winning recipe.”
It is:
the challenge has already evolved from a small-model contest into a byte-aware joint optimization problem, but the public run archive is still too young to reveal the final dominant style.