ReTok

(Gu et al., 2024)

Sources: arXiv:2410.04335 · alphaXiv overview

Core contribution

ReTok shows that tokenizer replacement does not necessarily require end-to-end retraining. By changing the tokenizer and then retraining only embeddings and the LM head, the paper recovers much of the alignment between a pretrained model and a new, more efficient tokenization scheme.

Why this matters for Parameter Golf

This is one of the strongest direct supports for tokenizer and vocabulary efficiency. The paper breaks a common assumption that tokenization is too entangled with the whole network to be a practical lever. For a byte-constrained challenge, that matters because tokenizer decisions affect sequence length, embedding size, and output-head cost all at once.

What to import

Tokenizer changes can be high leverage.
Embedding and LM-head adaptation may be enough to absorb a tokenizer swap.
Token count reductions matter most where inputs are long or distribution-shifted.

What not to over-import

ReTok does not mean every tokenizer replacement is easy, or that a better compression ratio automatically implies a better challenge score. The paper’s real value is as a feasibility result: tokenizer work can be an engineering tractable lever rather than a full-model restart.

Best synthesis links

Connects naturally to Beyond Text Compression, which warns that tokenizer quality must be judged with richer metrics.
Pairs with Vocabulary Compression because tokenizer and output-head design are tightly coupled.
Reinforces tokenizer efficiency by showing that tokenization is not merely preprocessing.

Parameter Golf translation

ReTok motivates asking:

is a better tokenizer worth more than a small architecture tweak?
can the tokenizer/output stack be improved without disturbing the rest of the model family too much?
how much sequence-length reduction survives once measured in bits per byte rather than bits per token?

Gu, S., Zhao, M., Zhang, B., Wang, L., Li, J., & Liu, G. (2024). ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model. arXiv Preprint arXiv:2410.04335. https://arxiv.org/abs/2410.04335

Parameter Golf Research Garden

Section Tree

ReTok

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

ReTok

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes