This page summarizes the public Naive Baseline record.

Hard public facts

From the upstream public record:

  • track: main 10-minute / 16 MB leaderboard
  • date: 2026-03-18
  • reported score: val_bpb = 1.22436570
  • total submission size: 15,863,489 bytes
  • code size: 47,642 bytes
  • compressed model size (int8 + zlib): 15,815,847 bytes
  • layout summary: 9 layers, model dim 512, vocab 1024, 4 KV heads
  • embeddings: tied input/output embeddings
  • training budget: wallclock-capped at 600 seconds on 8xH100
  • stopping point: 13,780 steps before the wallclock cap ended the run
  • pre-quant metric at stop: val_bpb = 1.2172
  • post-roundtrip metric: val_bpb = 1.2244

The record README describes it as a simple baseline using the published fineweb10B_sp1024 dataset/tokenizer export and the current train_gpt.py snapshot.

Why this run matters

This is the first thing a history page needs: a credible anchor point.

The main value of the run is not that it proves an optimal design. Its value is that it shows a clean, reproducible answer to the question:

What does a legal, straightforward, leaderboard-visible Parameter Golf submission look like?

The answer, at least initially, is: a compact tied-embedding transformer with a small vocabulary, modest width, and a byte budget spent almost entirely on the compressed model itself.

Conceptual read

The baseline highlights several challenge realities at once:

1. Vocabulary size is a first-order design choice

A 1024-token vocabulary is not just an implementation detail. Under a hard artifact cap, it also helps control the cost of the embedding / LM-head machinery. See tokenizer and vocabulary efficiency and the output-head budget.

2. Tied embeddings are an obvious early byte-saving primitive

Tying input and output embeddings is one of the cleanest ways to avoid paying twice for a large output-side matrix. That does not prove it is always optimal, but it is exactly the kind of move the challenge rewards.

3. The final score is about the recovered artifact, not the floating-point model

The gap between 1.2172 pre-quant and 1.2244 post-roundtrip is a small but important reminder that the scored object is the compressed-and-recovered artifact.

What this baseline does not prove

It does not tell us, by itself, whether the eventual winning strategy will center on:

  • recurrent/shared-depth design
  • selective outlier protection
  • BitNet-like low-bit parameterization
  • more aggressive tokenizer redesign
  • bounded evaluation-time compute

Those remain open lanes rather than publicly settled conclusions.

Most natural lane connections

This run connects most directly to:

The paper links here are interpretive rather than evidentiary:

These papers help explain why the baseline looks plausible; they do not mean the public record explicitly used those methods.