Sources: arXiv:2603.10145 · alphaXiv overview
Core contribution
The paper argues that the LM head is not only an expressive or memory bottleneck, but a gradient bottleneck: the low-rank hidden-to-vocabulary projection can suppress or distort training signal before it reaches the rest of the model. In other words, the output side can limit optimization efficiency even when it is not the biggest line item in the artifact budget.
Why this matters for Parameter Golf
This matters because compact-model work often treats the head as a byte problem and the trunk as the real learning problem. The paper says those are entangled. If the head constrains gradient flow, then trunk improvements may plateau or look weaker than they really are because the training signal itself is bottlenecked at the output interface.
What to import
- The LM head is part of the optimization path, not just the final projection.
- Output-head design can shape how efficiently a compact model learns.
- Small-model performance may be limited by head geometry before the trunk is obviously saturated.
What not to over-import
This paper is not a direct compression method. It does not tell us which alternative head to ship inside a strict artifact cap. Its value is diagnostic: it upgrades the head from “deployment nuisance” to “possible learning bottleneck,” which changes how we interpret compact-model training results.
Best synthesis links
- Gives a stronger training-side foundation to the output-head budget note.
- Complements VQ-Logits by arguing that head redesign can matter for optimization, not only storage.
- Tightens the case for output-head compression and for joint tokenizer/head design.
Parameter Golf translation
This paper suggests a sharper evaluation question:
- when a compact run stalls, is the trunk really the problem,
- or is the head limiting both byte budget and gradient quality?
That makes output-side experimentation more defensible earlier in the search than “compress the trunk harder and only then look at the head.”
Related
- VQ-Logits
- Vocabulary Compression for Low-Compute Environments
- The LM Head Is Part of the Compression Problem
- Output-head compression
- Tokenizer and vocabulary efficiency