Core idea

In compact language models, the LM head is not a boring final layer. It can be a major consumer of both stored bytes and repeated compute.

That means a model comparison that focuses only on the transformer backbone can miss the real bottleneck.

Why this matters

Changing vocabulary size changes at least three things at once:

  • token count
  • output-layer size
  • logits computation cost

A larger vocabulary may shorten sequences but create a more expensive head. A smaller vocabulary may save bytes in the head but force the model to process more tokens. The best point is therefore a whole-system tradeoff, not a one-dimensional tokenizer preference.

Why compact models feel this earlier

Large models can sometimes absorb an inefficient output path because the backbone dominates anyway. Compact models have less slack.

That makes it plausible that:

  • modest backbone improvements are bottlenecked by the head
  • tokenizer changes alter the best backbone size
  • output-projection compression becomes a first-order design choice

Useful framing

The LM head should be treated as a separate compression target with its own design space:

  • vocabulary size selection
  • tied versus untied structure
  • factorized or low-rank output projections
  • structured compression specific to the output matrix

This is the synthesis note behind output-head compression.