Small models do not simply behave like scaled-down frontier models. Different bottlenecks appear earlier, and some modeling choices mainly matter because of what they do to those bottlenecks.
Core question
Which parts of a compact LLM are actually expensive enough to reshape the best design choice?
In this regime, the answer may be:
- logits and vocabulary handling
- sequence length
- recurrent reuse of the same block
- width allocation inside a shared model
- stabilization steps required by aggressive compression
Central papers
- Computational Bottlenecks of Training SLMs (Ashkboos et al., 2024)
- Plan Early (Grangier et al., 2024)
- Vocabulary Compression for Low-Compute Environments (Vennam et al., 2024)
Main takeaways for compact LLM design
- attention efficiency matters earlier than many people expect (Ashkboos et al., 2024)
- logits and vocabulary handling can become a first-order systems and modeling problem (Vennam et al., 2024)
- a design that looks small on paper may still waste compute if it creates too many tokens or too expensive an output path
- recurrent or shared-depth designs are appealing partly because they shift the budget from stored depth toward reusable computation
Why this lane matters conceptually
This lane is not just about speed. It helps explain why some ideas become more attractive under a compact-model budget:
- output-head compression because vocab costs appear early
- recursive width scaling because reused depth can be better than storing many unique thin layers
- iterative refinement over stored depth because extra compute may be cheaper than extra bytes
Questions worth keeping connected
- when is sequence reduction better than parameter reduction?
- how often is the LM head the hidden bottleneck in a “small” model?
- when does recurrent reuse save bytes without creating too much extra compute?
- which stabilization methods are worth their overhead because they improve compression enough?
Related
- Tokenizer and vocabulary efficiency
- Recursive and shared-parameter architectures
- The LM head is part of the compression problem
- Compute-for-storage exchange
Ashkboos, S., Mirzadeh, I., Alizadeh, K., Sekhavat, M. H., Nabi, M., Farajtabar, M., & Faghri, F. (2024). Computational Bottlenecks of Training Small-scale Large Language Models. arXiv Preprint arXiv:2410.19456. https://arxiv.org/abs/2410.19456
Grangier, D., Katharopoulos, A., Ablin, P., & Hannun, A. (2024). Need a Small Specialized Language Model? Plan Early! arXiv Preprint arXiv:2402.01093. https://arxiv.org/abs/2402.01093
Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371