The LM Head Is Part of the Compression Problem

Core idea

In compact language models, the LM head is not a boring final layer. It can be a major consumer of both stored bytes and repeated compute.

That means a model comparison that focuses only on the transformer backbone can miss the real bottleneck.

Why this matters

Changing vocabulary size changes at least three things at once:

token count
output-layer size
logits computation cost

A larger vocabulary may shorten sequences but create a more expensive head. A smaller vocabulary may save bytes in the head but force the model to process more tokens. The best point is therefore a whole-system tradeoff, not a one-dimensional tokenizer preference.

Why compact models feel this earlier

Large models can sometimes absorb an inefficient output path because the backbone dominates anyway. Compact models have less slack.

That makes it plausible that:

modest backbone improvements are bottlenecked by the head
tokenizer changes alter the best backbone size
output-projection compression becomes a first-order design choice

Useful framing

The LM head should be treated as a separate compression target with its own design space:

vocabulary size selection
tied versus untied structure
factorized or low-rank output projections
structured compression specific to the output matrix

This is the synthesis note behind output-head compression.

Parameter Golf Research Garden

Section Tree

The LM Head Is Part of the Compression Problem

Core idea

Why this matters

Why compact models feel this earlier

Useful framing

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

The LM Head Is Part of the Compression Problem

Core idea

Why this matters

Why compact models feel this earlier

Useful framing

Related

Graph View

Table of Contents

Referenced by

Recent notes