(Gu et al., 2024)

Sources: arXiv:2410.04335 · alphaXiv overview

Core contribution

ReTok shows that tokenizer replacement does not necessarily require end-to-end retraining. By changing the tokenizer and then retraining only embeddings and the LM head, the paper recovers much of the alignment between a pretrained model and a new, more efficient tokenization scheme.

Why this matters for Parameter Golf

This is one of the strongest direct supports for tokenizer and vocabulary efficiency. The paper breaks a common assumption that tokenization is too entangled with the whole network to be a practical lever. For a byte-constrained challenge, that matters because tokenizer decisions affect sequence length, embedding size, and output-head cost all at once.

What to import

  • Tokenizer changes can be high leverage.
  • Embedding and LM-head adaptation may be enough to absorb a tokenizer swap.
  • Token count reductions matter most where inputs are long or distribution-shifted.

What not to over-import

ReTok does not mean every tokenizer replacement is easy, or that a better compression ratio automatically implies a better challenge score. The paper’s real value is as a feasibility result: tokenizer work can be an engineering tractable lever rather than a full-model restart.

Parameter Golf translation

ReTok motivates asking:

  • is a better tokenizer worth more than a small architecture tweak?
  • can the tokenizer/output stack be improved without disturbing the rest of the model family too much?
  • how much sequence-length reduction survives once measured in bits per byte rather than bits per token?
Gu, S., Zhao, M., Zhang, B., Wang, L., Li, J., & Liu, G. (2024). ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model. arXiv Preprint arXiv:2410.04335. https://arxiv.org/abs/2410.04335