These lanes are the top-level buckets for background reading, hypothesis formation, and synthesis across the compact-LLM design space.

Active lanes

Recursive and shared-parameter architectures

How to trade stored depth for reused computation, wider blocks, or cheap per-step specialization.

Key pages:

Quantization, outliers, and compression-aware training

How to make the final compressed artifact behave like the trained model instead of collapsing under uniform low-bit treatment.

Key pages:

Tokenizer and vocabulary efficiency

How tokenization, vocabulary size, and the LM head reshape both compute and stored bytes in compact language models.

Key pages:

Training economics and small-model bottlenecks

How compact-model regimes change what matters: logits, sequence length, reuse, and width allocation can dominate sooner than standard scaling intuitions suggest.

Key pages:

Evaluation-time compute and inference scaling

How a compact model can use bounded extra reasoning or refinement steps to outperform a larger static artifact.

Key pages:

Why lane pages matter

Paper notes are too granular and experiment logs are too specific. Lane pages are where we answer:

  • what lever this family is trying to pull
  • why it matters under the 16 MB cap
  • which mechanisms sit inside the lane
  • where the lane naturally composes with other lanes
  • which concrete hypotheses deserve follow-up next

Cross-lane tensions worth tracking

  • storage vs compute: shared depth and evaluation-time refinement both spend time to save bytes
  • uniformity vs selectivity: low-bit methods win when they stop treating every tensor as equally fragile
  • sequence length vs vocab size: a tokenizer can save tokens while making the output layer harder to store
  • width vs specialization: recurrent blocks want more width, but they often need cheap phase-specific behavior to avoid collapse

5 items under this folder.