Sources: arXiv:2410.04335 · alphaXiv overview
Core contribution
ReTok shows that tokenizer replacement does not necessarily require end-to-end retraining. By changing the tokenizer and then retraining only embeddings and the LM head, the paper recovers much of the alignment between a pretrained model and a new, more efficient tokenization scheme.
Why this matters for Parameter Golf
This is one of the strongest direct supports for tokenizer and vocabulary efficiency. The paper breaks a common assumption that tokenization is too entangled with the whole network to be a practical lever. For a byte-constrained challenge, that matters because tokenizer decisions affect sequence length, embedding size, and output-head cost all at once.
What to import
- Tokenizer changes can be high leverage.
- Embedding and LM-head adaptation may be enough to absorb a tokenizer swap.
- Token count reductions matter most where inputs are long or distribution-shifted.
What not to over-import
ReTok does not mean every tokenizer replacement is easy, or that a better compression ratio automatically implies a better challenge score. The paper’s real value is as a feasibility result: tokenizer work can be an engineering tractable lever rather than a full-model restart.
Best synthesis links
- Connects naturally to Beyond Text Compression, which warns that tokenizer quality must be judged with richer metrics.
- Pairs with Vocabulary Compression because tokenizer and output-head design are tightly coupled.
- Reinforces tokenizer efficiency by showing that tokenization is not merely preprocessing.
Parameter Golf translation
ReTok motivates asking:
- is a better tokenizer worth more than a small architecture tweak?
- can the tokenizer/output stack be improved without disturbing the rest of the model family too much?
- how much sequence-length reduction survives once measured in bits per byte rather than bits per token?
Related
- Tokenizer Evaluation Across Scales
- Vocabulary Compression for Low-Compute Environments
- ALBERT
- Tokenizer and vocabulary efficiency
- Tokenizer efficiency