This page is about the public challenge record and the visible research storyline around it. It is not a private experiment log; for the repository’s own denser chronology, use local experiment history.
Public snapshot: March 20, 2026
As of March 20, 2026, the public surface is no longer sparse.
The visible record now includes:
- 14 main-track record folders on the upstream
mainbranch - 1 official non-record run on the upstream
mainbranch - 133 open PR-backed public submissions
- 36 recently closed PR-backed public submissions
- 7 ahead-of-upstream fork refs visible through public branch comparison
That matters because the challenge now has two different public tempos:
- the slower, more conservative official accepted-record surface
- the much faster PR frontier, where ideas become public before they are accepted or normalized into the README spotlight list
This is also why the public board needs to be read with more care than it did on day one: some of the strongest PR-side claims are val-only or otherwise not directly comparable to the accepted main-track record path.
Why a history page matters
Parameter Golf moves fast, and the obvious ideas can look novel if they are rediscovered in isolation. A good history page helps separate:
- what the baseline already taught everyone
- what kinds of official runs or PRs have already made a direction legible
- what still looks underexplored in the open record
Phase 0: the baseline clarified the real objective
The earliest public record made several things obvious:
- the challenge is about the final artifact, not an uncompressed checkpoint
- post-export degradation is large enough to matter
- the budget is tight enough that bytes must be treated as a first-class design resource
- even code bytes are worth tracking because the margin is not infinite
The baseline also established the challenge’s basic mental model: a respectable small transformer can be competitive, but only if its export path survives the roundtrip cleanly.
Phase 1: the field shifted from “small dense model” to “bytes-aware model”
Once the rules were taken seriously, the center of gravity moved away from plain dense scaling and toward techniques like:
- tied embeddings and output-side discipline
- shared depth or recurrent blocks
- compression-aware regularization
- quantization-aware stabilization
- export-side heuristics for preserving the most fragile weights
That shift is important historically because it marks the point where Parameter Golf stopped looking like standard small-model training and started looking like a specialized artifact-optimization problem.
Phase 2: evaluation protocol and adaptation became public levers
The next visible shift was that the score path itself became a research object.
Another public shift was the realization that the export path is not just bookkeeping. It is part of the model.
That encouraged work on:
- roundtrip-aware training objectives
- normalization and scaling changes that reduce export fragility
- outlier-aware or sensitivity-aware weight treatment
- direct attempts to shrink the gap between floating-point quality and scored-artifact quality
- evaluation protocol changes such as sliding-window evaluation
- test-time adaptation such as LoRA TTT and stronger test-time training variants
This is the moment where quantization and outliers stopped being a deployment concern and became a core modeling concern.
The official README spotlight already captured this shift with:
- Naive Baseline
- 4-Hour Baseline
Sliding Window EvalLora TTT- long-context
2048and4ksequence runs
Phase 3: March 20 turned the official board into a quantization stack race
The most visible official acceleration happened on March 20, 2026.
By that point, the accepted main-track public runs on main were no longer just baseline-style references. The top accepted band had become dominated by stacks that combined:
- low-bit quantization (
int5,int6, QAT, mixed precision) - schedule and optimizer tuning (
Muon, weight decay, SWA) - widening or reshaping the feedforward path (
MLP3x,MLP2.6x) - output-side or token-side tricks such as
BigramHash - targeted initialization or stabilization choices (
OrthoInit, SmearGate, tuned scaling)
The best accepted main-track score visible on main as of March 20, 2026 was:
1.1428 val_bpb—10L Int5-MLP + BigramHash(10240)bythwu1
That matters historically because it changed what “serious public competition” now looks like. The official record is no longer baseline-plus-one-trick; it is already a stacked compression-and-training package.
Phase 4: the PR frontier moved ahead of the accepted board
The open PR frontier is now a research surface in its own right.
Publicly visible examples on March 20, 2026 included:
- val-only runs below
1.0(0.9588closed PR,0.9695open PR variant) - paid-prefix attempts around
1.02to1.05 - TTT-heavy variants around
1.13 - partial-attention / XSA-style variants around
1.13 - a wide band of
1.14–1.16records built from different quantization + schedule + head/token stacks
The important historical point is not that every such PR will survive review. It is that the public conversation has already widened beyond the accepted board.
What the public record has covered already
The open challenge conversation has made at least these directions clearly legible:
- dense baseline models with careful export
- low-bit quantization as the dominant public lane
- optimizer / schedule tuning as a central supporting lever
- sliding-window evaluation as a mainstream public trick rather than a niche idea
- token/output-side engineering such as BigramHash and related vocabulary-head discipline
- test-time training as a real public family rather than a hypothetical edge case
Anyone claiming novelty should assume those areas are already part of the public conversation unless the contribution is genuinely more specific.
What still appears relatively open in the public record
The following directions still look less saturated or less settled:
- officially accepted shared-depth / recurrent records; recurrence is visible in PRs and forks, but not yet established on the accepted board
- output-head or prefix-memory ideas that survive the standard evaluation path rather than only appearing as frontier PR claims
- tokenizer redesign aimed directly at the challenge metric, not only token-side hash tricks layered on standard vocabularies
- deeper evaluation-time memory or latent refinement that remains comparable to the standard scoring path
- artifact formats and selective precision schemes that justify their bookkeeping cleanly enough to become accepted, not just interesting
That does not guarantee they are unexplored in private, only that they still feel comparatively open in the visible record.
What public runs have already taught, even without a final winner
A few durable lessons seem clear:
-
Artifact-aware quality is the real battleground. A model that looks great before export can still lose the challenge.
-
The board is now stack-shaped. Publicly serious runs combine quantization, schedule tuning, token/output-side tricks, and eval protocol choices rather than one isolated novelty.
-
Comparability is already a research problem. Open PR claims can be stronger numerically than accepted runs while still being harder to compare because of val-only settings or altered evaluation paths.
-
Recurrence is public but not yet officially dominant. Shared-depth and recurrent ideas are clearly in the conversation, but the accepted board is still more quantization-stack-heavy than recurrence-heavy.
-
The frontier is compositional. Winning ideas will likely combine quantization robustness, token/output-side efficiency, and budget-aware evaluation rather than rely on a single gimmick.
How to use this page
Use this page to keep the challenge narrative straight:
- start here if you want context for what has already been publicly legible
- use the leaderboard guide to interpret claims correctly
- use public research directions to see where the pressure is pushing next
- use the atlas for the full research map