Parameter Golf Challenge

The Parameter Golf challenge is not a generic “small model” contest. It asks for a language model whose entire self-contained submission artifact fits under 16,000,000 bytes, trains within a fixed budget, and is judged by tokenizer-agnostic bits per byte on a fixed evaluation set.

That framing changes the problem from “how many parameters can we afford?” to “what buys the most language-model quality per stored byte?”

What the challenge is really optimizing

A strong submission is not just a checkpoint. It is a bundle of choices that have to work together:

model architecture
tokenizer or byte-handling strategy
training recipe
export and compression path
evaluation behavior under the allowed time budget

In other words, Parameter Golf is closer to a joint systems + modeling + compression problem than a pure architecture leaderboard.

Why the design space bends so hard

Because the hard cap is on the final artifact, the challenge strongly rewards ideas that:

reduce the number of unique weights rather than only the number of nominal parameters
make stored weights easier to quantize and compress
spend more capability on compute and less on bytes
control vocabulary and output-head overhead
preserve quality after the exact serialization and decompression path

That is why the most relevant research lanes are:

The core mental shift

Many habits from normal LM work become misleading here:

lower pre-quantization loss is not automatically a win
larger raw parameter count is not automatically a win
better-looking uncompressed checkpoints are not automatically a win
“just quantize it later” is often too weak a strategy

The score only cares about the model that actually survives the full artifact path and still performs well within the challenge constraints.

Challenge-facing questions this garden tracks

This section of the garden is meant to answer questions like:

What exactly do the byte cap and evaluation rules reward?
How should public runs and leaderboard claims be interpreted?
Which directions already look crowded, and which still look open?
Where do local results translate cleanly to the public challenge, and where do they not?

Pages in this section

One-sentence summary

A good Parameter Golf entry is a model whose stored bytes, training recipe, compression behavior, and evaluation strategy are all co-designed so that the final 16 MB artifact is unusually strong on the actual challenge metric.

Parameter Golf Research Garden

Section Tree

Parameter Golf Challenge

What the challenge is really optimizing

Why the design space bends so hard

The core mental shift

Challenge-facing questions this garden tracks

Pages in this section

One-sentence summary

Constraints and Scoring

History and Public Runs

How to Read the Leaderboard and Public Records

Local Benchmark vs Official Evaluation

Public Research Directions

Graph View