Hypothesis

In compact language models, the LM head and vocabulary path may consume enough bytes and compute that compressing or restructuring them buys more than a modest improvement in the transformer backbone.

Why this is plausible

Three facts point in the same direction:

  • tokenizer choice affects how large the vocabulary and logits path must be
  • compact models can hit vocab and LM-head bottlenecks surprisingly early
  • a backbone improvement is partly wasted if the output path remains the dominant cost center

This is the natural hypothesis version of The LM head is part of the compression problem.

Candidate mechanisms

  • smaller or more targeted vocabularies
  • factorized or low-rank output projections
  • compression schemes that treat the LM head separately from the rest of the model
  • tying, clustering, or codebook-like structure in the output path

What would support it

  • the same storage budget yielding better end performance when bytes move from the LM head into the backbone or vice versa
  • tokenizer changes altering the best backbone design more than expected
  • compact models whose main bottleneck turns out to be logits rather than hidden-state capacity

Main risks

  • token count rises enough to erase the savings
  • factorized or compressed heads damage rare-token behavior too much
  • gains depend on a tokenizer-data match that does not generalize