The challenge rules already imply a research agenda. Public literature and visible challenge work mostly sharpen that agenda rather than replace it.
Current public snapshot: March 20, 2026
As of March 20, 2026, the public tracker and upstream repository expose a much denser signal surface than the early README alone suggests:
- 14 accepted main-track runs on
main - 1 accepted non-record run
- 133 open PR-backed public submissions
- 36 recently closed PR-backed public submissions
- 7 ahead-of-upstream fork refs
The pressure is not evenly distributed. The public surface is now most crowded around:
- quantization
- optimizer / schedule tuning
- sliding-window evaluation
- token/output-side tricks such as BigramHash-like vocabulary-head interventions
Smaller but still real public clusters exist around:
- test-time training
- long-context variants
- output-head restructuring
- recurrence / shared-depth ideas
That snapshot should change how this page is read. The question is no longer “what seems plausible from the literature?” but “which literature-backed ideas have already become public competition surfaces, and which still remain comparatively open?”
The five big directions the challenge naturally creates
1. Shared depth and recurrent structure
The byte cap strongly favors models that reuse a small amount of structure many times.
Why it matters:
- buys depth without paying for fully unique blocks
- turns compute into effective capacity
- pairs naturally with evaluation-time unrolling
Best supporting pages:
- Recursive and shared-parameter architectures
- Relaxed Recursive Transformers
- MoEUT
- Fine-grained Parameter Sharing
- Transformers are SSMs
2. Quantization-aware robustness and outlier control
If the scored path runs through a compressed artifact, training must produce weights that survive that path.
Why it matters:
- post-roundtrip quality is the real target
- outliers and brittle scales can waste bytes and damage quality
- small normalization or scaling changes can matter more than larger architectural changes
Best supporting pages:
3. Selective precision instead of uniform precision
A fixed artifact budget rarely wants perfectly uniform treatment of every tensor.
Why it matters:
- a small protected subset may buy more than globally raising precision
- mixed artifacts can target the truly sensitive parts of the model
- this reframes compression as allocation rather than only shrinkage
Best supporting pages:
4. Tokenizer and output-head efficiency
The challenge metric is bits per byte, not bits per token. That makes the frontend and output side more central than many model builders expect.
Why it matters:
- sequence length interacts with evaluation cost
- vocabulary choices affect both tokenizer assets and output-layer size
- the output side can dominate bytes in small models
Best supporting pages:
- Tokenizer and vocabulary efficiency
- ReTok
- Vocabulary Compression
- VQ-Logits
- The LM Head is a Gradient Bottleneck
- Tokenizer Evaluation Across Scales
5. Evaluation-time compute
Parameter Golf leaves room for methods that store less but compute more at evaluation time, as long as the run stays within the cap.
Why it matters:
- capability may be cheaper to recompute than to store
- recurrent cores get much more attractive when extra unrolling is legal
- tiny models may benefit disproportionately from carefully budgeted extra passes
Best supporting pages:
What looks crowded already
These directions already have strong public motivation and visible competition as of March 20, 2026:
- low-bit robustness and export-aware training
- optimizer / schedule tuning attached to those low-bit stacks
- sliding-window evaluation and related protocol-aware scoring improvements
- token/output-side tricks that stay close to the existing transformer backbone
- initialization and stabilization tweaks that improve quantization friendliness
That does not make them exhausted. It means incremental work there has to be more specific and more empirically disciplined.
What still looks relatively open
Some areas still appear underexplored or at least less settled in the public record:
- officially accepted shared-depth / recurrent records; recurrence is visible in open PRs and fork deltas, but it is not yet the center of the accepted board
- output-head or paid-prefix style memory schemes that survive the standard accepted evaluation path
- tokenizer redesign aimed directly at the challenge metric rather than only token-side hashing layered on the current setup
- evaluation-time memory or latent refinement that is comparable under the ordinary main-track path rather than only val-only or altered eval settings
- artifact formats that go beyond dense low-bit storage while keeping the bookkeeping simple enough to be acceptable
The most important interaction effects
The best challenge ideas are usually not isolated tricks. They compound across lanes:
- recurrence + eval-time compute can buy capability without storing more unique blocks
- normalization + outlier control can make low-bit export much more forgiving
- tokenizer efficiency + smaller output head can free bytes for the core model
- shared structure + selective precision can spend expensive bytes only where sharing hurts most
- test-time training + quantization stacks can create public-score jumps, but often at the cost of harder comparability
Newly public directions that deserve more weight
Three directions now deserve more emphasis than the older version of this page gave them:
1. Test-time adaptation is no longer hypothetical
The public surface now includes both an accepted LoRA TTT run and multiple stronger open-PR TTT variants. That means evaluation-time compute is not just about search or reranking anymore; it clearly includes adaptation during evaluation.
2. Paid-prefix or prefix-memory ideas are now visibly in play
Public PRs around paid-prefix schemes show that some competitors are explicitly trying to move capacity out of conventional trunk storage and into more structured externalized memory-like artifacts. Even if these do not become accepted records, they are now part of the public research map.
3. Recurrence is public, but still not officially proven
Shared-depth and recurrent ideas are no longer speculative from the literature alone. They are showing up in PRs and fork deltas. But because they are not yet dominant on the accepted board, they still read more like a live frontier than a settled lane.
How to use this page
If you want a challenge-level orientation:
- read Constraints and scoring
- read History and public runs
- use the atlas to see how the lanes fit together
- then drop into the relevant lane pages and paper notes