How to Read the Leaderboard and Public Records

Parameter Golf is easy to misread if you only look at a headline score. The challenge is constrained enough that a public run should be read as a bundle of tradeoffs, not just a single scalar result.

First question: what number is actually being reported?

The most important distinction is whether a claim is about:

a floating-point model before export
a proxy benchmark
the final scored artifact after the real roundtrip path

Only the last one deserves to be read as directly challenge-facing.

Second question: what bytes are being counted?

A result is much more believable when it makes the byte story explicit:

total artifact bytes
model bytes
code bytes
any extra metadata or side assets

A run that reports only parameter count or only raw checkpoint size is missing the central constraint of the challenge.

Third question: where is the quality coming from?

When reading a public record, ask whether the gain is mostly coming from:

better pre-export training
a smaller pre→post export drop
more efficient use of the output side or tokenizer
more evaluation-time compute
better byte allocation through sharing or selective precision

Those mechanisms imply very different levels of robustness and reproducibility.

Fourth question: how much runtime headroom is left?

Parameter Golf is not just a storage challenge. Evaluation and serious training claims must live within a time budget.

So a run is stronger when it shows:

the artifact fits comfortably or at least legally
the score survives the real runtime budget
any evaluation-time trick is cheap enough to remain deployable within the rules

Fifth question: is the improvement actually claim-sized?

In a noisy research environment, the interesting question is not “did the number move?” but:

did it move enough to matter?
was the gain repeated?
is the comparison being made on the same evaluation path?
is the score improvement buying anything after the full artifact roundtrip?

This is especially important in Parameter Golf because tiny cosmetic changes can improve proxy metrics while doing nothing for the real challenge objective.

A practical checklist for reading any public run

Before treating a result as important, check for:

post-roundtrip challenge-facing metric
explicit artifact-byte accounting
runtime story
comparison against a sensible baseline
explanation of the mechanism, not just the number
enough evidence that the result is not a one-off proxy artifact

Typical ways public records get over-read

treating a proxy result as a leaderboard result
ignoring code and metadata overhead
praising a model for raw size reduction without asking what happened to quality
assuming recurrence or evaluation-time compute is “cheating” even when the rules allow it
assuming a more complex export is better without asking what its byte and runtime overhead cost

What a strong leaderboard-style writeup should communicate

The best challenge-facing reports usually make five things legible at once:

score — what improved on the real objective
bytes — how the artifact budget was spent
runtime — whether the idea fits the time budget
mechanism — why the gain should generalize
evidence — why the result should be believed

Parameter Golf Research Garden

Section Tree

How to Read the Leaderboard and Public Records

First question: what number is actually being reported?

Second question: what bytes are being counted?

Third question: where is the quality coming from?

Fourth question: how much runtime headroom is left?

Fifth question: is the improvement actually claim-sized?

A practical checklist for reading any public run

Typical ways public records get over-read

What a strong leaderboard-style writeup should communicate

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

How to Read the Leaderboard and Public Records

First question: what number is actually being reported?

Second question: what bytes are being counted?

Third question: where is the quality coming from?

Fourth question: how much runtime headroom is left?

Fifth question: is the improvement actually claim-sized?

A practical checklist for reading any public run

Typical ways public records get over-read

What a strong leaderboard-style writeup should communicate

Related

Graph View

Table of Contents

Referenced by

Recent notes