(Ashkboos et al., 2024)

Sources: arXiv:2410.19456 · alphaXiv overview

Core contribution

The paper argues that small language model training has its own computational bottlenecks and that standard “bigger-model” intuitions can mislead. FlashAttention, memory behavior, hardware choice, and distributed strategy all interact differently at smaller scales, so raw utilization is often a poor proxy for quality-per-cost.

Why this matters for Parameter Golf

This paper is essential for interpreting local experiments soberly. In a compact-model search loop, apparent modeling gains can actually be schedule or throughput artifacts. Understanding what truly dominates wall clock and efficiency helps separate real research signal from accidental systems wins.

What to import

  • Small models are not just scaled-down big models.
  • Attention efficiency can matter more than expected.
  • Cost-effectiveness and raw utilization are different objectives.

What not to over-import

The paper does not tell us which architecture to choose, and some hardware-specific conclusions may not transfer directly. Its role in the garden is diagnostic: it tells us which kinds of benchmark wins deserve skepticism and which system knobs may matter more than expected.

Parameter Golf translation

Use this paper as a filter when judging experiments:

  • did quality improve, or did throughput simply change the effective training budget?
  • did a runtime tweak masquerade as a modeling win?
  • are we over-investing in the wrong systems bottleneck for this scale regime?
Ashkboos, S., Mirzadeh, I., Alizadeh, K., Sekhavat, M. H., Nabi, M., Farajtabar, M., & Faghri, F. (2024). Computational Bottlenecks of Training Small-scale Large Language Models. arXiv Preprint arXiv:2410.19456. https://arxiv.org/abs/2410.19456