Small models do not simply behave like scaled-down frontier models. Different bottlenecks appear earlier, and some modeling choices mainly matter because of what they do to those bottlenecks.

Core question

Which parts of a compact LLM are actually expensive enough to reshape the best design choice?

In this regime, the answer may be:

  • logits and vocabulary handling
  • sequence length
  • recurrent reuse of the same block
  • width allocation inside a shared model
  • stabilization steps required by aggressive compression

Central papers

Main takeaways for compact LLM design

  • attention efficiency matters earlier than many people expect (Ashkboos et al., 2024)
  • logits and vocabulary handling can become a first-order systems and modeling problem (Vennam et al., 2024)
  • a design that looks small on paper may still waste compute if it creates too many tokens or too expensive an output path
  • recurrent or shared-depth designs are appealing partly because they shift the budget from stored depth toward reusable computation

Why this lane matters conceptually

This lane is not just about speed. It helps explain why some ideas become more attractive under a compact-model budget:

Questions worth keeping connected

  • when is sequence reduction better than parameter reduction?
  • how often is the LM head the hidden bottleneck in a “small” model?
  • when does recurrent reuse save bytes without creating too much extra compute?
  • which stabilization methods are worth their overhead because they improve compression enough?
Ashkboos, S., Mirzadeh, I., Alizadeh, K., Sekhavat, M. H., Nabi, M., Farajtabar, M., & Faghri, F. (2024). Computational Bottlenecks of Training Small-scale Large Language Models. arXiv Preprint arXiv:2410.19456. https://arxiv.org/abs/2410.19456
Grangier, D., Katharopoulos, A., Ablin, P., & Hannun, A. (2024). Need a Small Specialized Language Model? Plan Early! arXiv Preprint arXiv:2402.01093. https://arxiv.org/abs/2402.01093
Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371