Sources: arXiv:2402.09977 · alphaXiv overview
Core contribution
Fast Vocabulary Transfer trains a new domain-specific tokenizer and efficiently initializes the new embeddings by inheriting or averaging embeddings from the original vocabulary. The point is not just smaller vocabularies, but a better domain-matched token inventory that can shorten sequences while also shrinking the embedding and output-head burden.
Why this matters for Parameter Golf
This is strong evidence that tokenizer choice is part of model compression, not a preprocessing footnote. If the model lives under a byte cap, then vocabulary rows and sequence length are both expensive, and domain-fit can improve both at once.
What to import
- Vocabulary size is a real compression lever. It changes both stored embedding/head bytes and runtime token count.
- Smaller can be better if domain fit improves. Bigger vocabularies are not automatically more efficient.
- Tokenizer changes compose with other compression methods. Vocabulary optimization is not an alternative to quantization; it can multiply its gains.
What not to over-import
The paper studies domain adaptation rather than a challenge with a fixed upstream tokenizer ecosystem. It does not prove that arbitrary tokenizer replacement is worth the engineering cost here. The more durable lesson is that vocabulary/head design should be optimized jointly with the rest of the byte budget.
Best synthesis links
- Sharpens Tokenizer-head co-design under a hard cap.
- Complements Vocabulary Compression for Low-Compute Environments by making the tokenizer transfer path more operational.
- Supports The LM head is part of the compression problem by showing that sequence efficiency and head size should be treated together.
Parameter Golf translation
The local question is not “should we always shrink the tokenizer?” It is:
- which vocabulary rows are paying for themselves
- whether a smaller or retargeted token inventory reduces both head bytes and token count
- whether output-head protection is cheaper than tokenizer redesign, or vice versa
Related
- Vocabulary Compression for Low-Compute Environments
- Beyond Text Compression
- ReTok
- Tokenizer-head co-design under a hard cap
- Tokenizer and vocabulary efficiency
- The LM head is part of the compression problem