Fast Vocabulary Transfer

(Gee et al., 2024)

Sources: arXiv:2402.09977 · alphaXiv overview

Core contribution

Fast Vocabulary Transfer trains a new domain-specific tokenizer and efficiently initializes the new embeddings by inheriting or averaging embeddings from the original vocabulary. The point is not just smaller vocabularies, but a better domain-matched token inventory that can shorten sequences while also shrinking the embedding and output-head burden.

Why this matters for Parameter Golf

This is strong evidence that tokenizer choice is part of model compression, not a preprocessing footnote. If the model lives under a byte cap, then vocabulary rows and sequence length are both expensive, and domain-fit can improve both at once.

What to import

Vocabulary size is a real compression lever. It changes both stored embedding/head bytes and runtime token count.
Smaller can be better if domain fit improves. Bigger vocabularies are not automatically more efficient.
Tokenizer changes compose with other compression methods. Vocabulary optimization is not an alternative to quantization; it can multiply its gains.

What not to over-import

The paper studies domain adaptation rather than a challenge with a fixed upstream tokenizer ecosystem. It does not prove that arbitrary tokenizer replacement is worth the engineering cost here. The more durable lesson is that vocabulary/head design should be optimized jointly with the rest of the byte budget.

Best synthesis links

Sharpens Tokenizer-head co-design under a hard cap.
Complements Vocabulary Compression for Low-Compute Environments by making the tokenizer transfer path more operational.
Supports The LM head is part of the compression problem by showing that sequence efficiency and head size should be treated together.

Parameter Golf translation

The local question is not “should we always shrink the tokenizer?” It is:

which vocabulary rows are paying for themselves
whether a smaller or retargeted token inventory reduces both head bytes and token count
whether output-head protection is cheaper than tokenizer redesign, or vice versa

Gee, L., Zugarini, A., Rigutini, L., & Torroni, P. (2024). Fast Vocabulary Transfer for Language Model Compression. arXiv Preprint arXiv:2402.09977. https://arxiv.org/abs/2402.09977

Parameter Golf Research Garden

Section Tree

Fast Vocabulary Transfer

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Fast Vocabulary Transfer

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes