Sources: arXiv:2410.19456 · alphaXiv overview
Core contribution
The paper argues that small language model training has its own computational bottlenecks and that standard “bigger-model” intuitions can mislead. FlashAttention, memory behavior, hardware choice, and distributed strategy all interact differently at smaller scales, so raw utilization is often a poor proxy for quality-per-cost.
Why this matters for Parameter Golf
This paper is essential for interpreting local experiments soberly. In a compact-model search loop, apparent modeling gains can actually be schedule or throughput artifacts. Understanding what truly dominates wall clock and efficiency helps separate real research signal from accidental systems wins.
What to import
- Small models are not just scaled-down big models.
- Attention efficiency can matter more than expected.
- Cost-effectiveness and raw utilization are different objectives.
What not to over-import
The paper does not tell us which architecture to choose, and some hardware-specific conclusions may not transfer directly. Its role in the garden is diagnostic: it tells us which kinds of benchmark wins deserve skepticism and which system knobs may matter more than expected.
Best synthesis links
- Anchors training economics.
- Supports local benchmark vs official evaluation by clarifying when local speedups are likely to be misleading.
- Connects to Vocabulary Compression because output-side costs can become systems bottlenecks too.
Parameter Golf translation
Use this paper as a filter when judging experiments:
- did quality improve, or did throughput simply change the effective training budget?
- did a runtime tweak masquerade as a modeling win?
- are we over-investing in the wrong systems bottleneck for this scale regime?
Related
- Training economics
- Local benchmark vs official evaluation
- Plan Early
- Vocabulary Compression
- Tokenizer efficiency