Computational Bottlenecks of Training SLMs

(Ashkboos et al., 2024)

Sources: arXiv:2410.19456 · alphaXiv overview

Core contribution

The paper argues that small language model training has its own computational bottlenecks and that standard “bigger-model” intuitions can mislead. FlashAttention, memory behavior, hardware choice, and distributed strategy all interact differently at smaller scales, so raw utilization is often a poor proxy for quality-per-cost.

Why this matters for Parameter Golf

This paper is essential for interpreting local experiments soberly. In a compact-model search loop, apparent modeling gains can actually be schedule or throughput artifacts. Understanding what truly dominates wall clock and efficiency helps separate real research signal from accidental systems wins.

What to import

Small models are not just scaled-down big models.
Attention efficiency can matter more than expected.
Cost-effectiveness and raw utilization are different objectives.

What not to over-import

The paper does not tell us which architecture to choose, and some hardware-specific conclusions may not transfer directly. Its role in the garden is diagnostic: it tells us which kinds of benchmark wins deserve skepticism and which system knobs may matter more than expected.

Best synthesis links

Anchors training economics.
Supports local benchmark vs official evaluation by clarifying when local speedups are likely to be misleading.
Connects to Vocabulary Compression because output-side costs can become systems bottlenecks too.

Parameter Golf translation

Use this paper as a filter when judging experiments:

did quality improve, or did throughput simply change the effective training budget?
did a runtime tweak masquerade as a modeling win?
are we over-investing in the wrong systems bottleneck for this scale regime?

Ashkboos, S., Mirzadeh, I., Alizadeh, K., Sekhavat, M. H., Nabi, M., Farajtabar, M., & Faghri, F. (2024). Computational Bottlenecks of Training Small-scale Large Language Models. arXiv Preprint arXiv:2410.19456. https://arxiv.org/abs/2410.19456

Parameter Golf Research Garden

Section Tree

Computational Bottlenecks of Training SLMs

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Computational Bottlenecks of Training SLMs

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes