The LM Head is a Gradient Bottleneck

(Godey & Artzi, 2026)

Sources: arXiv:2603.10145 · alphaXiv overview

Core contribution

The paper argues that the LM head is not only an expressive or memory bottleneck, but a gradient bottleneck: the low-rank hidden-to-vocabulary projection can suppress or distort training signal before it reaches the rest of the model. In other words, the output side can limit optimization efficiency even when it is not the biggest line item in the artifact budget.

Why this matters for Parameter Golf

This matters because compact-model work often treats the head as a byte problem and the trunk as the real learning problem. The paper says those are entangled. If the head constrains gradient flow, then trunk improvements may plateau or look weaker than they really are because the training signal itself is bottlenecked at the output interface.

What to import

The LM head is part of the optimization path, not just the final projection.
Output-head design can shape how efficiently a compact model learns.
Small-model performance may be limited by head geometry before the trunk is obviously saturated.

What not to over-import

This paper is not a direct compression method. It does not tell us which alternative head to ship inside a strict artifact cap. Its value is diagnostic: it upgrades the head from “deployment nuisance” to “possible learning bottleneck,” which changes how we interpret compact-model training results.

Best synthesis links

Gives a stronger training-side foundation to the output-head budget note.
Complements VQ-Logits by arguing that head redesign can matter for optimization, not only storage.
Tightens the case for output-head compression and for joint tokenizer/head design.

Parameter Golf translation

This paper suggests a sharper evaluation question:

when a compact run stalls, is the trunk really the problem,
or is the head limiting both byte budget and gradient quality?

That makes output-side experimentation more defensible earlier in the search than “compress the trunk harder and only then look at the head.”

Godey, N., & Artzi, Y. (2026). Lost in Backpropagation: The LM Head is a Gradient Bottleneck. arXiv Preprint arXiv:2603.10145. https://arxiv.org/abs/2603.10145

Parameter Golf Research Garden

Section Tree

The LM Head is a Gradient Bottleneck

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index