4-Hour Quasi-10B SP1024

This page summarizes the public 4-Hour Quasi-10B SP1024 non-record submission.

Hard public facts

From the upstream public record:

track: non-record-unlimited-compute-16mb
date: 2026-03-18
reported score: val_bpb = 1.20737944
total submission size: 15,810,161 bytes
code size: 47,642 bytes
compressed model size (int8 + zlib): 15,762,519 bytes
layout summary: still the same broad baseline family: 9 layers, model dim 512, vocab 1024, 4 KV heads
embeddings: tied input/output embeddings
wallclock budget: 14,400 seconds (4 hours), explicitly outside the main 10-minute cutoff
stopping point: 329,430 steps before the wallclock cap ended the run
best pre-quant metric at stop: val_bpb = 1.1749
post-roundtrip metric: val_bpb = 1.2074

The record README explicitly says this run was not intended to satisfy the 10-minute cutoff for the main leaderboard.

Why this run matters

This is the first publicly visible example of an important distinction inside Parameter Golf:

the best artifact family and the best training budget are not the same question.

The run keeps the artifact family almost unchanged while relaxing the training-time constraint. That makes it useful as an early public probe of how much headroom still exists inside the simple baseline design.

Conceptual read

1. More compute does buy improvement

Relative to the public main-track baseline, the score improves from 1.2244 to 1.2074.

That is real progress, but it is not a conceptual overthrow. Publicly, this still looks like “baseline family, trained longer” rather than a new architecture class.

2. Compression robustness remains central

The pre-quant / post-roundtrip gap is substantial here:

pre-quant: 1.1749
post-roundtrip: 1.2074
roundtrip degradation: about 0.0325 bpb

That is a strong public reminder that longer training alone does not remove the importance of compression-aware robustness.

3. Training-economics questions become visible very quickly

This run makes training economics concrete. If a much longer run on the same broad architecture family yields a modest but not revolutionary gain, then future public progress may depend less on “just train longer” and more on changing what the artifact is.

What this run still does not settle

Even though it is better than the main baseline, it does not publicly settle:

whether recurrence/shared depth beats dense unique depth under the same artifact cap
whether selective precision beats uniform post-hoc compression
whether a more radical tokenizer or head redesign changes the frontier
whether evaluation-time compute becomes a decisive lever

In other words, this run expands the baseline story; it does not yet close the main strategic questions.

Most natural lane connections

This run connects most directly to:

Helpful paper context:

Again, those links are conceptual background, not proof that the public run implemented those ideas.

Parameter Golf Research Garden

Section Tree

4-Hour Quasi-10B SP1024

Hard public facts

Why this run matters

Conceptual read

1. More compute does buy improvement

2. Compression robustness remains central

3. Training-economics questions become visible very quickly

What this run still does not settle

Most natural lane connections

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

4-Hour Quasi-10B SP1024

Hard public facts

Why this run matters

Conceptual read

1. More compute does buy improvement

2. Compression robustness remains central

3. Training-economics questions become visible very quickly

What this run still does not settle

Most natural lane connections

Related

Graph View

Table of Contents

Referenced by

Recent notes