Constraints and Scoring

Parameter Golf is shaped less by abstract model quality than by a few hard constraints. Those constraints are the reason the research frontier looks unusual.

Hard constraints that define the game

At a high level, the public challenge requires that:

the full submission artifact stays under 16,000,000 bytes
the artifact is self-contained
evaluation does not rely on external downloads or network access
serious submissions respect the official training and evaluation time budgets
the final score is computed on the actual artifact that is evaluated, not on an earlier floating-point checkpoint

This is what makes the challenge fundamentally different from “train the best tiny transformer you can.”

What the score actually rewards

The challenge is reported in bits per byte on a fixed validation set. That matters because it pushes attention toward the thing that is actually deployed and evaluated:

the compressed artifact that fits under the cap
the model recovered from that artifact
the behavior of that recovered model during scored evaluation

So the wrong targets are:

nominal parameter count by itself
raw checkpoint size by itself
pre-export validation loss by itself
“quality before quantization” by itself

The right target is the quality of the actual scored artifact.

What counts against the byte budget

The cap changes design incentives because it is not only “model weights.” In practice, the budget pressure falls on:

compressed model bytes
code needed to run the submission
tokenizer or vocabulary-related assets when they are part of the artifact
any extra machinery used to make evaluation legal and self-contained

That means a clever method can fail the challenge if it needs too much code, too much metadata, or too many stored exceptions.

Why Parameter Golf is not the same as parameter counting

A model can have many effective layers and still be byte-cheap if it reuses structure well. Conversely, a model can have a modest parameter count and still be wasteful if it stores too many unique tensors or uses a brittle export path.

The challenge therefore favors:

Parameter reuse
- recurrent depth
- cross-layer sharing
- shared bases with small layer-specific corrections
Selective precision
- most weights cheap
- fragile weights protected
- outliers handled deliberately instead of uniformly
Tokenizer and head discipline
- fewer evaluation-time symbols without paying too much artifact cost
- smaller or smarter output-side representations
Compute-for-bytes trades
- recover capability by spending extra evaluation compute instead of storing more unique weights

What credible progress looks like

A result is challenge-relevant when it improves the scored artifact without breaking the hard constraints. In practice, that means the most believable improvements are the ones that show:

better post-roundtrip quality
acceptable artifact bytes with clear headroom or better byte usage
acceptable runtime under the challenge budget
enough evidence that the gain is real rather than an artifact of proxy noise

That is why leaderboard-style claims should be read alongside bytes, runtime, and reporting quality rather than as a single scalar number.

Common mistakes in reading results

treating pre-quant quality as if it were the real score
ignoring code or metadata overhead because the weight file looks small
assuming dense uniqueness is better than structured reuse
dismissing evaluation-time compute even though the challenge permits it within a cap
confusing a local proxy win with a public challenge win

Parameter Golf Research Garden

Section Tree

Constraints and Scoring

Hard constraints that define the game

What the score actually rewards

What counts against the byte budget

Why Parameter Golf is not the same as parameter counting

What credible progress looks like

Common mistakes in reading results

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Constraints and Scoring

Hard constraints that define the game

What the score actually rewards

What counts against the byte budget

Why Parameter Golf is not the same as parameter counting

What credible progress looks like

Common mistakes in reading results

Related pages

Graph View

Table of Contents

Referenced by

Recent notes