(Gee et al., 2024)

Sources: arXiv:2402.09977 · alphaXiv overview

Core contribution

Fast Vocabulary Transfer trains a new domain-specific tokenizer and efficiently initializes the new embeddings by inheriting or averaging embeddings from the original vocabulary. The point is not just smaller vocabularies, but a better domain-matched token inventory that can shorten sequences while also shrinking the embedding and output-head burden.

Why this matters for Parameter Golf

This is strong evidence that tokenizer choice is part of model compression, not a preprocessing footnote. If the model lives under a byte cap, then vocabulary rows and sequence length are both expensive, and domain-fit can improve both at once.

What to import

  • Vocabulary size is a real compression lever. It changes both stored embedding/head bytes and runtime token count.
  • Smaller can be better if domain fit improves. Bigger vocabularies are not automatically more efficient.
  • Tokenizer changes compose with other compression methods. Vocabulary optimization is not an alternative to quantization; it can multiply its gains.

What not to over-import

The paper studies domain adaptation rather than a challenge with a fixed upstream tokenizer ecosystem. It does not prove that arbitrary tokenizer replacement is worth the engineering cost here. The more durable lesson is that vocabulary/head design should be optimized jointly with the rest of the byte budget.

Parameter Golf translation

The local question is not “should we always shrink the tokenizer?” It is:

  • which vocabulary rows are paying for themselves
  • whether a smaller or retargeted token inventory reduces both head bytes and token count
  • whether output-head protection is cheaper than tokenizer redesign, or vice versa
Gee, L., Zugarini, A., Rigutini, L., & Torroni, P. (2024). Fast Vocabulary Transfer for Language Model Compression. arXiv Preprint arXiv:2402.09977. https://arxiv.org/abs/2402.09977