This page tracks the working hypotheses that deserve repeated attention in the compact-LLM conceptual graph.
Current lane balance
The graph now has three especially important clusters:
- compression-aware robustness for closing the train-to-artifact gap
- shared-depth architectures for trading storage into reusable computation
- tokenizer, vocabulary, and LM-head efficiency for treating the output path as part of the budget
Evaluation-time compute remains a strategic overlay on top of compact recurrent designs rather than a fully separate world.
Current hypotheses
RMSNorm stabilized scaling
- Status: active / plausible
- Lane: Quantization and outliers
- Thesis: extra normalization before fragile projections reduces activation-scale volatility and improves post-roundtrip quality.
- Best companions: recursive width scaling, sparse outlier preservation
Sparse outlier preservation
- Status: active / unresolved but central
- Lane: Quantization and outliers
- Thesis: a tiny protected subset of sensitive parameters can buy much more quality than spending the same bytes uniformly.
- Best companions: RMSNorm stabilized scaling, output-head compression
Recursive width scaling
- Status: active / high-upside architecture bet
- Lane: Recursive and shared-parameter architectures
- Thesis: storing fewer unique blocks and using the savings for width or lightweight adaptation may dominate many-unique-layer baselines.
- Concrete descendants: Recurrent wide architecture, Phase-conditioned sharing
Recurrent wide architecture
- Status: active / concrete architecture sketch
- Lane: Recursive and shared-parameter architectures
- Thesis: one or a few wide shared blocks may use the artifact budget better than many thinner unique layers.
- Best companions: RMSNorm stabilized scaling, sparse outlier preservation
Phase-conditioned sharing
- Status: active / conceptual bridge hypothesis
- Lane: Recursive and shared-parameter architectures
- Thesis: tiny per-depth conditioning may recover much of the specialization lost by strict sharing at far lower byte cost than fully unique layers.
- Best companions: Recursive width scaling, iterative refinement over stored depth
Output-head compression
- Status: active / underexplored
- Lane: Tokenizer and vocabulary efficiency
- Thesis: the LM head and vocabulary path may dominate the compact-model budget enough that compressing or restructuring them beats modest backbone tweaks.
- Best companions: sparse outlier preservation, iterative refinement over stored depth
Iterative refinement over stored depth
- Status: exploratory / strategically important
- Lane: Evaluation-time compute and inference scaling
- Thesis: a smaller recurrent model that spends bounded extra compute at evaluation may outperform a larger static artifact under the same storage cap.
- Best companions: Recurrent wide architecture, Phase-conditioned sharing
Near-term ranking by conceptual leverage
- RMSNorm stabilized scaling
- Sparse outlier preservation
- Recursive width scaling
- Output-head compression
- Phase-conditioned sharing
- Iterative refinement over stored depth
Cross-lane questions
- when does architecture beat compression-side robustness work?
- how often is the LM head the hidden bottleneck in a compact model?
- what is the cheapest form of specialization that keeps shared-depth models expressive?
- when does extra evaluation-time compute beat extra stored parameters?
- which improvements survive once the whole artifact, including vocab-dependent pieces, is compressed?