Sources: arXiv:2411.06371 · alphaXiv overview
Core contribution
The paper argues that the vocabulary / logits side of language modeling can be a major memory and compute bottleneck, especially in low-compute settings. By grouping vocabulary prediction, it reduces output-layer cost without paying the full quality penalty of a naive vocabulary shrink.
Why this matters for Parameter Golf
This paper is easy to underrate because most compact-model discussions obsess over attention, MLPs, and weight quantization. But under a hard artifact cap, embeddings and output projections are often some of the most expensive persistent structures in the whole model. That makes vocabulary-side design an unusually direct byte lever.
What to import
- The output head is a first-class optimization target.
- Vocabulary structure can buy both memory and compute savings.
- Tokenizer and output-head choices are coupled.
What not to over-import
Vocabulary grouping is not free: it changes modeling assumptions and may interact with tokenizer quality in non-obvious ways. The paper does not mean “just shrink the vocab”; it means the output side deserves explicit design rather than passive inheritance.
Best synthesis links
- Reinforces tokenizer and vocabulary efficiency.
- Connects directly to tokenizer efficiency by showing that sequence length is only half the story; logits cost matters too.
- Pairs with ALBERT because factorized embedding parameterization is another way to attack large embedding/output costs.
Parameter Golf translation
This paper motivates asking:
- how much of the artifact is really tied up in embeddings and the LM head?
- could output-side restructuring buy more than another round of trunk compression?
- should tokenizer and output-layer design be tuned jointly rather than sequentially?
Related
- ReTok
- Tokenizer Evaluation Across Scales
- ALBERT
- Tokenizer and vocabulary efficiency
- Training economics
- Tokenizer efficiency