The Parameter Golf challenge is not a generic “small model” contest. It asks for a language model whose entire self-contained submission artifact fits under 16,000,000 bytes, trains within a fixed budget, and is judged by tokenizer-agnostic bits per byte on a fixed evaluation set.

That framing changes the problem from “how many parameters can we afford?” to “what buys the most language-model quality per stored byte?”

What the challenge is really optimizing

A strong submission is not just a checkpoint. It is a bundle of choices that have to work together:

  • model architecture
  • tokenizer or byte-handling strategy
  • training recipe
  • export and compression path
  • evaluation behavior under the allowed time budget

In other words, Parameter Golf is closer to a joint systems + modeling + compression problem than a pure architecture leaderboard.

Why the design space bends so hard

Because the hard cap is on the final artifact, the challenge strongly rewards ideas that:

  • reduce the number of unique weights rather than only the number of nominal parameters
  • make stored weights easier to quantize and compress
  • spend more capability on compute and less on bytes
  • control vocabulary and output-head overhead
  • preserve quality after the exact serialization and decompression path

That is why the most relevant research lanes are:

The core mental shift

Many habits from normal LM work become misleading here:

  • lower pre-quantization loss is not automatically a win
  • larger raw parameter count is not automatically a win
  • better-looking uncompressed checkpoints are not automatically a win
  • “just quantize it later” is often too weak a strategy

The score only cares about the model that actually survives the full artifact path and still performs well within the challenge constraints.

Challenge-facing questions this garden tracks

This section of the garden is meant to answer questions like:

  • What exactly do the byte cap and evaluation rules reward?
  • How should public runs and leaderboard claims be interpreted?
  • Which directions already look crowded, and which still look open?
  • Where do local results translate cleanly to the public challenge, and where do they not?

Pages in this section

One-sentence summary

A good Parameter Golf entry is a model whose stored bytes, training recipe, compression behavior, and evaluation strategy are all co-designed so that the final 16 MB artifact is unusually strong on the actual challenge metric.

5 items under this folder.