This page summarizes the public run record currently visible in the upstream challenge repository.
Important scope note
The public record is still sparse. At the moment, it mainly shows:
- a reference baseline for the main
10-minute / 16 MBtrack - an unlimited-compute non-record extension of roughly the same baseline family
So this page should be read as an early history page, not as a mature leaderboard chronicle.
Public runs visible in the repository snapshot
| Run | Track | Score (val_bpb) | Total bytes | What it establishes |
|---|---|---|---|---|
| Naive Baseline | Main leaderboard | 1.2244 | 15,863,489 | A conventional small tied-embedding transformer can fit the artifact cap and produce a credible baseline score. |
| 4-Hour Quasi-10B SP1024 | Non-record, unlimited compute | 1.2074 | 15,810,161 | Longer training on essentially the same artifact family improves score, but does not by itself rewrite the problem. |
What the public record already suggests
Even this tiny sample supports a few cautious conclusions:
1. The challenge is immediately artifact-centric
Both visible runs sit just under the 16,000,000 byte cap, which reinforces the point made in Constraints and scoring: the challenge is not about nominal parameter count alone.
2. The first public reference point is intentionally simple
The public main-track record is a baseline-style run, not an exotic architecture manifesto. That matters because it gives future submissions a clean anchor.
3. Extra training helps, but the public evidence is still narrow
The unlimited-compute run is better than the main-track baseline, but it is still the same broad model family. Publicly, we do not yet have a recurrence-heavy, tokenizer-heavy, or evaluation-time-compute-heavy submission to compare against it.
What is still missing from the public record
There is not yet enough disclosed evidence here to rank, with confidence, the public viability of:
- recursive and shared-parameter architectures
- quantization and outlier-aware methods
- tokenizer and vocabulary redesign
- evaluation-time compute
Those lanes are strongly suggested by the challenge framing and literature, but they are not yet represented by clearly public, leaderboard-facing run writeups in this snapshot.
Best way to read this page
Use the individual run pages for the detailed distinction between:
- hard public facts
- reasonable interpretation
- unknowns the record does not settle