(Vennam et al., 2024)

Sources: arXiv:2411.06371 · alphaXiv overview

Core contribution

The paper argues that the vocabulary / logits side of language modeling can be a major memory and compute bottleneck, especially in low-compute settings. By grouping vocabulary prediction, it reduces output-layer cost without paying the full quality penalty of a naive vocabulary shrink.

Why this matters for Parameter Golf

This paper is easy to underrate because most compact-model discussions obsess over attention, MLPs, and weight quantization. But under a hard artifact cap, embeddings and output projections are often some of the most expensive persistent structures in the whole model. That makes vocabulary-side design an unusually direct byte lever.

What to import

  • The output head is a first-class optimization target.
  • Vocabulary structure can buy both memory and compute savings.
  • Tokenizer and output-head choices are coupled.

What not to over-import

Vocabulary grouping is not free: it changes modeling assumptions and may interact with tokenizer quality in non-obvious ways. The paper does not mean “just shrink the vocab”; it means the output side deserves explicit design rather than passive inheritance.

Parameter Golf translation

This paper motivates asking:

  • how much of the artifact is really tied up in embeddings and the LM head?
  • could output-side restructuring buy more than another round of trunk compression?
  • should tokenizer and output-layer design be tuned jointly rather than sequentially?
Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371