Training Economics and Small-Model Bottlenecks

Small models do not simply behave like scaled-down frontier models. Different bottlenecks appear earlier, and some modeling choices mainly matter because of what they do to those bottlenecks.

Core question

Which parts of a compact LLM are actually expensive enough to reshape the best design choice?

In this regime, the answer may be:

logits and vocabulary handling
sequence length
recurrent reuse of the same block
width allocation inside a shared model
stabilization steps required by aggressive compression

Central papers

Main takeaways for compact LLM design

attention efficiency matters earlier than many people expect (Ashkboos et al., 2024)
logits and vocabulary handling can become a first-order systems and modeling problem (Vennam et al., 2024)
a design that looks small on paper may still waste compute if it creates too many tokens or too expensive an output path
recurrent or shared-depth designs are appealing partly because they shift the budget from stored depth toward reusable computation

Why this lane matters conceptually

This lane is not just about speed. It helps explain why some ideas become more attractive under a compact-model budget:

output-head compression because vocab costs appear early
recursive width scaling because reused depth can be better than storing many unique thin layers
iterative refinement over stored depth because extra compute may be cheaper than extra bytes

Questions worth keeping connected

when is sequence reduction better than parameter reduction?
how often is the LM head the hidden bottleneck in a “small” model?
when does recurrent reuse save bytes without creating too much extra compute?
which stabilization methods are worth their overhead because they improve compression enough?

Ashkboos, S., Mirzadeh, I., Alizadeh, K., Sekhavat, M. H., Nabi, M., Farajtabar, M., & Faghri, F. (2024). Computational Bottlenecks of Training Small-scale Large Language Models. arXiv Preprint arXiv:2410.19456. https://arxiv.org/abs/2410.19456

Grangier, D., Katharopoulos, A., Ablin, P., & Hannun, A. (2024). Need a Small Specialized Language Model? Plan Early! arXiv Preprint arXiv:2402.01093. https://arxiv.org/abs/2402.01093

Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371

Parameter Golf Research Garden

Section Tree

Training Economics and Small-Model Bottlenecks

Core question

Central papers

Main takeaways for compact LLM design

Why this lane matters conceptually

Questions worth keeping connected

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Training Economics and Small-Model Bottlenecks

Core question

Central papers

Main takeaways for compact LLM design

Why this lane matters conceptually

Questions worth keeping connected

Related

Graph View

Table of Contents

Referenced by

Recent notes