The Parameter Golf challenge is not a generic “small model” contest. It asks for a language model whose entire self-contained submission artifact fits under 16,000,000 bytes, trains within a fixed budget, and is judged by tokenizer-agnostic bits per byte on a fixed evaluation set.
That framing changes the problem from “how many parameters can we afford?” to “what buys the most language-model quality per stored byte?”
What the challenge is really optimizing
A strong submission is not just a checkpoint. It is a bundle of choices that have to work together:
- model architecture
- tokenizer or byte-handling strategy
- training recipe
- export and compression path
- evaluation behavior under the allowed time budget
In other words, Parameter Golf is closer to a joint systems + modeling + compression problem than a pure architecture leaderboard.
Why the design space bends so hard
Because the hard cap is on the final artifact, the challenge strongly rewards ideas that:
- reduce the number of unique weights rather than only the number of nominal parameters
- make stored weights easier to quantize and compress
- spend more capability on compute and less on bytes
- control vocabulary and output-head overhead
- preserve quality after the exact serialization and decompression path
That is why the most relevant research lanes are:
- recursive and shared-parameter architectures
- quantization, outliers, and compression-aware training
- tokenizer and vocabulary efficiency
- evaluation-time compute
- training economics and small-model bottlenecks
The core mental shift
Many habits from normal LM work become misleading here:
- lower pre-quantization loss is not automatically a win
- larger raw parameter count is not automatically a win
- better-looking uncompressed checkpoints are not automatically a win
- “just quantize it later” is often too weak a strategy
The score only cares about the model that actually survives the full artifact path and still performs well within the challenge constraints.
Challenge-facing questions this garden tracks
This section of the garden is meant to answer questions like:
- What exactly do the byte cap and evaluation rules reward?
- How should public runs and leaderboard claims be interpreted?
- Which directions already look crowded, and which still look open?
- Where do local results translate cleanly to the public challenge, and where do they not?
Pages in this section
- Constraints and scoring
- History and public runs
- How to read the leaderboard and public records
- Local benchmark vs official evaluation
- Public research directions
- Challenge history
- Local experiment history
One-sentence summary
A good Parameter Golf entry is a model whose stored bytes, training recipe, compression behavior, and evaluation strategy are all co-designed so that the final 16 MB artifact is unusually strong on the actual challenge metric.