Naive Baseline

This page summarizes the public Naive Baseline record.

Hard public facts

From the upstream public record:

track: main 10-minute / 16 MB leaderboard
date: 2026-03-18
reported score: val_bpb = 1.22436570
total submission size: 15,863,489 bytes
code size: 47,642 bytes
compressed model size (int8 + zlib): 15,815,847 bytes
layout summary: 9 layers, model dim 512, vocab 1024, 4 KV heads
embeddings: tied input/output embeddings
training budget: wallclock-capped at 600 seconds on 8xH100
stopping point: 13,780 steps before the wallclock cap ended the run
pre-quant metric at stop: val_bpb = 1.2172
post-roundtrip metric: val_bpb = 1.2244

The record README describes it as a simple baseline using the published fineweb10B_sp1024 dataset/tokenizer export and the current train_gpt.py snapshot.

Why this run matters

This is the first thing a history page needs: a credible anchor point.

The main value of the run is not that it proves an optimal design. Its value is that it shows a clean, reproducible answer to the question:

What does a legal, straightforward, leaderboard-visible Parameter Golf submission look like?

The answer, at least initially, is: a compact tied-embedding transformer with a small vocabulary, modest width, and a byte budget spent almost entirely on the compressed model itself.

Conceptual read

The baseline highlights several challenge realities at once:

1. Vocabulary size is a first-order design choice

A 1024-token vocabulary is not just an implementation detail. Under a hard artifact cap, it also helps control the cost of the embedding / LM-head machinery. See tokenizer and vocabulary efficiency and the output-head budget.

2. Tied embeddings are an obvious early byte-saving primitive

Tying input and output embeddings is one of the cleanest ways to avoid paying twice for a large output-side matrix. That does not prove it is always optimal, but it is exactly the kind of move the challenge rewards.

3. The final score is about the recovered artifact, not the floating-point model

The gap between 1.2172 pre-quant and 1.2244 post-roundtrip is a small but important reminder that the scored object is the compressed-and-recovered artifact.

What this baseline does not prove

It does not tell us, by itself, whether the eventual winning strategy will center on:

recurrent/shared-depth design
selective outlier protection
BitNet-like low-bit parameterization
more aggressive tokenizer redesign
bounded evaluation-time compute

Those remain open lanes rather than publicly settled conclusions.

Most natural lane connections

This run connects most directly to:

The paper links here are interpretive rather than evidentiary:

These papers help explain why the baseline looks plausible; they do not mean the public record explicitly used those methods.

Parameter Golf Research Garden

Section Tree

Naive Baseline

Hard public facts

Why this run matters

Conceptual read

1. Vocabulary size is a first-order design choice

2. Tied embeddings are an obvious early byte-saving primitive

3. The final score is about the recovered artifact, not the floating-point model

What this baseline does not prove

Most natural lane connections

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Naive Baseline

Hard public facts

Why this run matters

Conceptual read

1. Vocabulary size is a first-order design choice

2. Tied embeddings are an obvious early byte-saving primitive

3. The final score is about the recovered artifact, not the floating-point model

What this baseline does not prove

Most natural lane connections

Related

Graph View

Table of Contents

Referenced by

Recent notes