Vocabulary Compression for Low-Compute Environments

(Vennam et al., 2024)

Sources: arXiv:2411.06371 · alphaXiv overview

Core contribution

The paper argues that the vocabulary / logits side of language modeling can be a major memory and compute bottleneck, especially in low-compute settings. By grouping vocabulary prediction, it reduces output-layer cost without paying the full quality penalty of a naive vocabulary shrink.

Why this matters for Parameter Golf

This paper is easy to underrate because most compact-model discussions obsess over attention, MLPs, and weight quantization. But under a hard artifact cap, embeddings and output projections are often some of the most expensive persistent structures in the whole model. That makes vocabulary-side design an unusually direct byte lever.

What to import

The output head is a first-class optimization target.
Vocabulary structure can buy both memory and compute savings.
Tokenizer and output-head choices are coupled.

What not to over-import

Vocabulary grouping is not free: it changes modeling assumptions and may interact with tokenizer quality in non-obvious ways. The paper does not mean “just shrink the vocab”; it means the output side deserves explicit design rather than passive inheritance.

Best synthesis links

Reinforces tokenizer and vocabulary efficiency.
Connects directly to tokenizer efficiency by showing that sequence length is only half the story; logits cost matters too.
Pairs with ALBERT because factorized embedding parameterization is another way to attack large embedding/output costs.

Parameter Golf translation

This paper motivates asking:

how much of the artifact is really tied up in embeddings and the LM head?
could output-side restructuring buy more than another round of trunk compression?
should tokenizer and output-layer design be tuned jointly rather than sequentially?

Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371

Parameter Golf Research Garden

Section Tree

Vocabulary Compression for Low-Compute Environments

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Vocabulary Compression for Low-Compute Environments

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes