This page summarizes the public 4-Hour Quasi-10B SP1024 non-record submission.
Hard public facts
From the upstream public record:
- track:
non-record-unlimited-compute-16mb - date:
2026-03-18 - reported score:
val_bpb = 1.20737944 - total submission size:
15,810,161bytes - code size:
47,642bytes - compressed model size (
int8 + zlib):15,762,519bytes - layout summary: still the same broad baseline family:
9layers, model dim512, vocab1024,4KV heads - embeddings: tied input/output embeddings
- wallclock budget:
14,400seconds (4hours), explicitly outside the main 10-minute cutoff - stopping point:
329,430steps before the wallclock cap ended the run - best pre-quant metric at stop:
val_bpb = 1.1749 - post-roundtrip metric:
val_bpb = 1.2074
The record README explicitly says this run was not intended to satisfy the 10-minute cutoff for the main leaderboard.
Why this run matters
This is the first publicly visible example of an important distinction inside Parameter Golf:
the best artifact family and the best training budget are not the same question.
The run keeps the artifact family almost unchanged while relaxing the training-time constraint. That makes it useful as an early public probe of how much headroom still exists inside the simple baseline design.
Conceptual read
1. More compute does buy improvement
Relative to the public main-track baseline, the score improves from 1.2244 to 1.2074.
That is real progress, but it is not a conceptual overthrow. Publicly, this still looks like “baseline family, trained longer” rather than a new architecture class.
2. Compression robustness remains central
The pre-quant / post-roundtrip gap is substantial here:
- pre-quant:
1.1749 - post-roundtrip:
1.2074 - roundtrip degradation: about
0.0325bpb
That is a strong public reminder that longer training alone does not remove the importance of compression-aware robustness.
3. Training-economics questions become visible very quickly
This run makes training economics concrete. If a much longer run on the same broad architecture family yields a modest but not revolutionary gain, then future public progress may depend less on “just train longer” and more on changing what the artifact is.
What this run still does not settle
Even though it is better than the main baseline, it does not publicly settle:
- whether recurrence/shared depth beats dense unique depth under the same artifact cap
- whether selective precision beats uniform post-hoc compression
- whether a more radical tokenizer or head redesign changes the frontier
- whether evaluation-time compute becomes a decisive lever
In other words, this run expands the baseline story; it does not yet close the main strategic questions.
Most natural lane connections
This run connects most directly to:
- training economics and small-model bottlenecks
- quantization and outlier handling
- tokenizer and vocabulary efficiency
Helpful paper context:
Again, those links are conceptual background, not proof that the public run implemented those ideas.