This page summarizes the public 4-Hour Quasi-10B SP1024 non-record submission.

Hard public facts

From the upstream public record:

  • track: non-record-unlimited-compute-16mb
  • date: 2026-03-18
  • reported score: val_bpb = 1.20737944
  • total submission size: 15,810,161 bytes
  • code size: 47,642 bytes
  • compressed model size (int8 + zlib): 15,762,519 bytes
  • layout summary: still the same broad baseline family: 9 layers, model dim 512, vocab 1024, 4 KV heads
  • embeddings: tied input/output embeddings
  • wallclock budget: 14,400 seconds (4 hours), explicitly outside the main 10-minute cutoff
  • stopping point: 329,430 steps before the wallclock cap ended the run
  • best pre-quant metric at stop: val_bpb = 1.1749
  • post-roundtrip metric: val_bpb = 1.2074

The record README explicitly says this run was not intended to satisfy the 10-minute cutoff for the main leaderboard.

Why this run matters

This is the first publicly visible example of an important distinction inside Parameter Golf:

the best artifact family and the best training budget are not the same question.

The run keeps the artifact family almost unchanged while relaxing the training-time constraint. That makes it useful as an early public probe of how much headroom still exists inside the simple baseline design.

Conceptual read

1. More compute does buy improvement

Relative to the public main-track baseline, the score improves from 1.2244 to 1.2074.

That is real progress, but it is not a conceptual overthrow. Publicly, this still looks like “baseline family, trained longer” rather than a new architecture class.

2. Compression robustness remains central

The pre-quant / post-roundtrip gap is substantial here:

  • pre-quant: 1.1749
  • post-roundtrip: 1.2074
  • roundtrip degradation: about 0.0325 bpb

That is a strong public reminder that longer training alone does not remove the importance of compression-aware robustness.

3. Training-economics questions become visible very quickly

This run makes training economics concrete. If a much longer run on the same broad architecture family yields a modest but not revolutionary gain, then future public progress may depend less on “just train longer” and more on changing what the artifact is.

What this run still does not settle

Even though it is better than the main baseline, it does not publicly settle:

  • whether recurrence/shared depth beats dense unique depth under the same artifact cap
  • whether selective precision beats uniform post-hoc compression
  • whether a more radical tokenizer or head redesign changes the frontier
  • whether evaluation-time compute becomes a decisive lever

In other words, this run expands the baseline story; it does not yet close the main strategic questions.

Most natural lane connections

This run connects most directly to:

Helpful paper context:

Again, those links are conceptual background, not proof that the public run implemented those ideas.