(Godey & Artzi, 2026)

Sources: arXiv:2603.10145 · alphaXiv overview

Core contribution

The paper argues that the LM head is not only an expressive or memory bottleneck, but a gradient bottleneck: the low-rank hidden-to-vocabulary projection can suppress or distort training signal before it reaches the rest of the model. In other words, the output side can limit optimization efficiency even when it is not the biggest line item in the artifact budget.

Why this matters for Parameter Golf

This matters because compact-model work often treats the head as a byte problem and the trunk as the real learning problem. The paper says those are entangled. If the head constrains gradient flow, then trunk improvements may plateau or look weaker than they really are because the training signal itself is bottlenecked at the output interface.

What to import

  • The LM head is part of the optimization path, not just the final projection.
  • Output-head design can shape how efficiently a compact model learns.
  • Small-model performance may be limited by head geometry before the trunk is obviously saturated.

What not to over-import

This paper is not a direct compression method. It does not tell us which alternative head to ship inside a strict artifact cap. Its value is diagnostic: it upgrades the head from “deployment nuisance” to “possible learning bottleneck,” which changes how we interpret compact-model training results.

Parameter Golf translation

This paper suggests a sharper evaluation question:

  • when a compact run stalls, is the trunk really the problem,
  • or is the head limiting both byte budget and gradient quality?

That makes output-side experimentation more defensible earlier in the search than “compress the trunk harder and only then look at the head.”

Godey, N., & Artzi, Y. (2026). Lost in Backpropagation: The LM Head is a Gradient Bottleneck. arXiv Preprint arXiv:2603.10145. https://arxiv.org/abs/2603.10145