Beyond Text Compression

(Lotz et al., 2025)

Sources: arXiv:2506.03101 · alphaXiv overview

Core contribution

The paper argues that tokenizer quality should not be judged by compression alone. Across scales and especially outside narrow English-only settings, a richer mix of intrinsic metrics predicts downstream performance better. It also suggests that smaller models can often be used to rank tokenizer options, which is a major practical compute saver.

Why this matters for Parameter Golf

This is one of the most important notes for preventing bad tokenizer research habits. In this challenge, it would be easy to overfocus on sequence compression and ignore whether a tokenizer actually improves downstream efficiency and quality. The paper says that shortcut is often unreliable.

What to import

Compression is necessary but insufficient.
Tokenizer evaluation deserves its own methodology.
Small proxy models can still be informative for ranking tokenizer choices.

What not to over-import

The paper does not imply that every intrinsic metric suite is worth the overhead in a fast-moving local loop. The important lesson is epistemic: tokenizer work needs better validation than simple token-count anecdotes.

Best synthesis links

Refines tokenizer efficiency by widening the evaluation lens.
Complements ReTok, which shows that tokenizer replacement can be practical once a good tokenizer is identified.
Pairs with Plan Early because early budget-aware design depends on evaluating the right tokenization objective.

Parameter Golf translation

This paper suggests that tokenizer candidates should be compared on a bundle of questions:

how much do they shorten sequences?
what do they do to downstream quality at small scale?
do their benefits survive the actual scoring metric rather than only token-count summaries?

Lotz, J. F., Lopes, A. V., Peitz, S., Setiawan, H., & Emili, L. (2025). Beyond Text Compression: Evaluating Tokenizers Across Scales. arXiv Preprint arXiv:2506.03101. https://arxiv.org/abs/2506.03101

Parameter Golf Research Garden

Section Tree

Beyond Text Compression

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Beyond Text Compression

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes