History and Public Runs

This page is about the public challenge record and the visible research storyline around it. It is not a private experiment log; for the repository’s own denser chronology, use local experiment history.

Public snapshot: March 20, 2026

As of March 20, 2026, the public surface is no longer sparse.

The visible record now includes:

14 main-track record folders on the upstream main branch
1 official non-record run on the upstream main branch
133 open PR-backed public submissions
36 recently closed PR-backed public submissions
7 ahead-of-upstream fork refs visible through public branch comparison

That matters because the challenge now has two different public tempos:

the slower, more conservative official accepted-record surface
the much faster PR frontier, where ideas become public before they are accepted or normalized into the README spotlight list

This is also why the public board needs to be read with more care than it did on day one: some of the strongest PR-side claims are val-only or otherwise not directly comparable to the accepted main-track record path.

Why a history page matters

Parameter Golf moves fast, and the obvious ideas can look novel if they are rediscovered in isolation. A good history page helps separate:

what the baseline already taught everyone
what kinds of official runs or PRs have already made a direction legible
what still looks underexplored in the open record

Phase 0: the baseline clarified the real objective

The earliest public record made several things obvious:

the challenge is about the final artifact, not an uncompressed checkpoint
post-export degradation is large enough to matter
the budget is tight enough that bytes must be treated as a first-class design resource
even code bytes are worth tracking because the margin is not infinite

The baseline also established the challenge’s basic mental model: a respectable small transformer can be competitive, but only if its export path survives the roundtrip cleanly.

Phase 1: the field shifted from “small dense model” to “bytes-aware model”

Once the rules were taken seriously, the center of gravity moved away from plain dense scaling and toward techniques like:

tied embeddings and output-side discipline
shared depth or recurrent blocks
compression-aware regularization
quantization-aware stabilization
export-side heuristics for preserving the most fragile weights

That shift is important historically because it marks the point where Parameter Golf stopped looking like standard small-model training and started looking like a specialized artifact-optimization problem.

Phase 2: evaluation protocol and adaptation became public levers

The next visible shift was that the score path itself became a research object.

Another public shift was the realization that the export path is not just bookkeeping. It is part of the model.

That encouraged work on:

roundtrip-aware training objectives
normalization and scaling changes that reduce export fragility
outlier-aware or sensitivity-aware weight treatment
direct attempts to shrink the gap between floating-point quality and scored-artifact quality
evaluation protocol changes such as sliding-window evaluation
test-time adaptation such as LoRA TTT and stronger test-time training variants

This is the moment where quantization and outliers stopped being a deployment concern and became a core modeling concern.

The official README spotlight already captured this shift with:

Naive Baseline
4-Hour Baseline
Sliding Window Eval
Lora TTT
long-context 2048 and 4k sequence runs

Phase 3: March 20 turned the official board into a quantization stack race

The most visible official acceleration happened on March 20, 2026.

By that point, the accepted main-track public runs on main were no longer just baseline-style references. The top accepted band had become dominated by stacks that combined:

low-bit quantization (int5, int6, QAT, mixed precision)
schedule and optimizer tuning (Muon, weight decay, SWA)
widening or reshaping the feedforward path (MLP3x, MLP2.6x)
output-side or token-side tricks such as BigramHash
targeted initialization or stabilization choices (OrthoInit, SmearGate, tuned scaling)

The best accepted main-track score visible on main as of March 20, 2026 was:

1.1428 val_bpb — 10L Int5-MLP + BigramHash(10240) by thwu1

That matters historically because it changed what “serious public competition” now looks like. The official record is no longer baseline-plus-one-trick; it is already a stacked compression-and-training package.

Phase 4: the PR frontier moved ahead of the accepted board

The open PR frontier is now a research surface in its own right.

Publicly visible examples on March 20, 2026 included:

val-only runs below 1.0 (0.9588 closed PR, 0.9695 open PR variant)
paid-prefix attempts around 1.02 to 1.05
TTT-heavy variants around 1.13
partial-attention / XSA-style variants around 1.13
a wide band of 1.14–1.16 records built from different quantization + schedule + head/token stacks

The important historical point is not that every such PR will survive review. It is that the public conversation has already widened beyond the accepted board.

What the public record has covered already

The open challenge conversation has made at least these directions clearly legible:

dense baseline models with careful export
low-bit quantization as the dominant public lane
optimizer / schedule tuning as a central supporting lever
sliding-window evaluation as a mainstream public trick rather than a niche idea
token/output-side engineering such as BigramHash and related vocabulary-head discipline
test-time training as a real public family rather than a hypothetical edge case

Anyone claiming novelty should assume those areas are already part of the public conversation unless the contribution is genuinely more specific.

What still appears relatively open in the public record

The following directions still look less saturated or less settled:

officially accepted shared-depth / recurrent records; recurrence is visible in PRs and forks, but not yet established on the accepted board
output-head or prefix-memory ideas that survive the standard evaluation path rather than only appearing as frontier PR claims
tokenizer redesign aimed directly at the challenge metric, not only token-side hash tricks layered on standard vocabularies
deeper evaluation-time memory or latent refinement that remains comparable to the standard scoring path
artifact formats and selective precision schemes that justify their bookkeeping cleanly enough to become accepted, not just interesting

That does not guarantee they are unexplored in private, only that they still feel comparatively open in the visible record.

What public runs have already taught, even without a final winner

A few durable lessons seem clear:

Artifact-aware quality is the real battleground. A model that looks great before export can still lose the challenge.
The board is now stack-shaped. Publicly serious runs combine quantization, schedule tuning, token/output-side tricks, and eval protocol choices rather than one isolated novelty.
Comparability is already a research problem. Open PR claims can be stronger numerically than accepted runs while still being harder to compare because of val-only settings or altered evaluation paths.
Recurrence is public but not yet officially dominant. Shared-depth and recurrent ideas are clearly in the conversation, but the accepted board is still more quantization-stack-heavy than recurrence-heavy.
The frontier is compositional. Winning ideas will likely combine quantization robustness, token/output-side efficiency, and budget-aware evaluation rather than rely on a single gimmick.

How to use this page

Use this page to keep the challenge narrative straight:

start here if you want context for what has already been publicly legible
use the leaderboard guide to interpret claims correctly
use public research directions to see where the pressure is pushing next
use the atlas for the full research map

Parameter Golf Research Garden

Section Tree