This layer is for frontier synthesis, not generic paper summaries. Each note below tries to connect multiple recent papers into a falsifiable claim about where the next non-obvious Parameter Golf gains may come from.
The goal is to answer:
what research seam looks underexploited now, and what concrete prediction would let us kill or promote it quickly?
How to use this layer
- Start here when paper notes feel too local and lane pages feel too broad.
- Treat each frontier as a cross-paper mechanism thesis.
- Prefer notes that say what would disconfirm the idea, not just why it sounds exciting.
Highest-leverage seams right now
1. Byte allocation beats average bit-width
Why it matters: the newest low-bit papers increasingly win by deciding which parameters deserve protection, not by globally lowering quantization error. This is the strongest bridge from pQuant, PTQ1.61, MicroScopiQ, and ClusComp.
Best fit with existing graph: quantization and outliers, sparse outlier preservation, Decoupled precision.
2. Compression interfaces for shared depth
Why it matters: recent recursion papers and recent low-bit-stability papers point at the same hidden problem: repeated blocks become much more viable if their inputs and role shifts are explicitly normalized and lightly conditioned. This is where Extra RMSNorm, QuEST, Relaxed Recursive Transformers, MoEUT, and Fine-grained Parameter Sharing start to rhyme.
Best fit with existing graph: recursive sharing, recursive width scaling, RMSNorm stabilized scaling.
3. Tokenizer-head co-design under a hard cap
Why it matters: the recent tokenizer papers do not say “smaller token count wins.” They say tokenization must be judged jointly with LM-head cost, logits cost, and domain fit. That makes ReTok, Vocabulary Compression, Beyond Text Compression, and Plan Early much more relevant than a standard tokenizer discussion would suggest.
Best fit with existing graph: tokenizer and vocabulary efficiency, training economics, Tokenizer efficiency.
4. Entropy-friendly model structure
Why it matters: some compression ideas look good in nominal bit-width and still lose once metadata and final coding are counted. The more interesting seam is whether we can design model structure that a downstream codec naturally likes: repeated bases, clustered values, shared blocks, and low-entropy exception patterns.
Best fit with existing graph: Additive Quantization, ClusComp, Fine-grained Parameter Sharing, quantization and outliers, recursive sharing.
5. Refinement loops as decompression
Why it matters: Inference Scaling Laws and Plan Early suggest that once storage is the bottleneck, extra evaluation-time compute can act like recovered capacity. For Parameter Golf, the frontier question is whether a compact recurrent model can use bounded extra passes to reconstruct some of the behavior that would otherwise have required more stored weights.
Best fit with existing graph: evaluation-time compute, recurrent wide architecture, recursive sharing.
6. Rate-distortion for artifact caps
Why it matters: Radio, OWQ, and ReALLM all imply that the strongest next wins may come from explicit byte-return accounting rather than nominal low-bit branding. This is the cleanest current bridge between information-theoretic allocation and concrete protected-structure design.
Best fit with existing graph: quantization and outliers, byte allocation beats average bit-width, entropy-friendly model structure.
7. Learned weight codecs and compressible training
Why it matters: the newest papers are starting to move beyond post-hoc low-bit tricks toward optimizer choices, training objectives, and learned weight representations that directly target future compressibility. This is the strongest current bridge from “better quantizer” to “better artifact ontology.”
Best fit with existing graph: moonshots, training economics, rate-distortion for artifact caps.
Ranking by evidence quality
Strongest evidence density
- Byte allocation beats average bit-width
- Learned weight codecs and compressible training
- Compression interfaces for shared depth
Most upside if true
- Learned weight codecs and compressible training
- Compression interfaces for shared depth
- Refinement loops as decompression
Highest risk of overfitting or complexity blow-up
- Learned weight codecs and compressible training
- Refinement loops as decompression
- Entropy-friendly model structure
A useful reading order
- Byte allocation beats average bit-width
- Rate-distortion for artifact caps
- Learned weight codecs and compressible training
- Compression interfaces for shared depth
- Entropy-friendly model structure
- Tokenizer-head co-design under a hard cap
- Refinement loops as decompression
That order now moves from explicit byte-allocation logic toward learned artifact formats, then on to architectural and compute-for-storage ideas.
What would count as real progress
A frontier note is doing its job if it helps us say one of the following:
- “this cross-paper thesis predicts a measurable win in our local benchmark”
- “this only sounds good until metadata or wall-clock constraints from the challenge are counted”
- “this lane should be split because two mechanisms that looked related are actually in tension”