These lanes are the top-level buckets for background reading, hypothesis formation, and synthesis across the compact-LLM design space.
Active lanes
Recursive and shared-parameter architectures
How to trade stored depth for reused computation, wider blocks, or cheap per-step specialization.
Key pages:
- Recursive width scaling
- Recurrent wide architecture
- Phase-conditioned sharing
- Recursive layer sharing
- Compute-for-storage exchange
Quantization, outliers, and compression-aware training
How to make the final compressed artifact behave like the trained model instead of collapsing under uniform low-bit treatment.
Key pages:
- RMSNorm stabilized scaling
- Sparse outlier preservation
- Normalization before projections
- Outlier-aware compression
- Decoupled precision
Tokenizer and vocabulary efficiency
How tokenization, vocabulary size, and the LM head reshape both compute and stored bytes in compact language models.
Key pages:
Training economics and small-model bottlenecks
How compact-model regimes change what matters: logits, sequence length, reuse, and width allocation can dominate sooner than standard scaling intuitions suggest.
Key pages:
- Output-head compression
- Recursive width scaling
- The LM head is part of the compression problem
- Compute-for-storage exchange
Evaluation-time compute and inference scaling
How a compact model can use bounded extra reasoning or refinement steps to outperform a larger static artifact.
Key pages:
Why lane pages matter
Paper notes are too granular and experiment logs are too specific. Lane pages are where we answer:
- what lever this family is trying to pull
- why it matters under the 16 MB cap
- which mechanisms sit inside the lane
- where the lane naturally composes with other lanes
- which concrete hypotheses deserve follow-up next
Cross-lane tensions worth tracking
- storage vs compute: shared depth and evaluation-time refinement both spend time to save bytes
- uniformity vs selectivity: low-bit methods win when they stop treating every tensor as equally fragile
- sequence length vs vocab size: a tokenizer can save tokens while making the output layer harder to store
- width vs specialization: recurrent blocks want more width, but they often need cheap phase-specific behavior to avoid collapse