This page summarizes the public Naive Baseline record.
Hard public facts
From the upstream public record:
- track: main
10-minute / 16 MBleaderboard - date:
2026-03-18 - reported score:
val_bpb = 1.22436570 - total submission size:
15,863,489bytes - code size:
47,642bytes - compressed model size (
int8 + zlib):15,815,847bytes - layout summary:
9layers, model dim512, vocab1024,4KV heads - embeddings: tied input/output embeddings
- training budget: wallclock-capped at
600seconds on8xH100 - stopping point:
13,780steps before the wallclock cap ended the run - pre-quant metric at stop:
val_bpb = 1.2172 - post-roundtrip metric:
val_bpb = 1.2244
The record README describes it as a simple baseline using the published fineweb10B_sp1024 dataset/tokenizer export and the current train_gpt.py snapshot.
Why this run matters
This is the first thing a history page needs: a credible anchor point.
The main value of the run is not that it proves an optimal design. Its value is that it shows a clean, reproducible answer to the question:
What does a legal, straightforward, leaderboard-visible Parameter Golf submission look like?
The answer, at least initially, is: a compact tied-embedding transformer with a small vocabulary, modest width, and a byte budget spent almost entirely on the compressed model itself.
Conceptual read
The baseline highlights several challenge realities at once:
1. Vocabulary size is a first-order design choice
A 1024-token vocabulary is not just an implementation detail. Under a hard artifact cap, it also helps control the cost of the embedding / LM-head machinery. See tokenizer and vocabulary efficiency and the output-head budget.
2. Tied embeddings are an obvious early byte-saving primitive
Tying input and output embeddings is one of the cleanest ways to avoid paying twice for a large output-side matrix. That does not prove it is always optimal, but it is exactly the kind of move the challenge rewards.
3. The final score is about the recovered artifact, not the floating-point model
The gap between 1.2172 pre-quant and 1.2244 post-roundtrip is a small but important reminder that the scored object is the compressed-and-recovered artifact.
What this baseline does not prove
It does not tell us, by itself, whether the eventual winning strategy will center on:
- recurrent/shared-depth design
- selective outlier protection
- BitNet-like low-bit parameterization
- more aggressive tokenizer redesign
- bounded evaluation-time compute
Those remain open lanes rather than publicly settled conclusions.
Most natural lane connections
This run connects most directly to:
- tokenizer and vocabulary efficiency
- training economics and small-model bottlenecks
- quantization and outlier handling
The paper links here are interpretive rather than evidentiary:
These papers help explain why the baseline looks plausible; they do not mean the public record explicitly used those methods.