Tokenizer Efficiency

The mistake to avoid

Tokenizer work is easy to treat as preprocessing. In compact-model settings it is part of the core systems design.

Why it matters here

A tokenizer changes:

sequence length
output-layer structure
training throughput
evaluation-time cost
sometimes the artifact budget itself

That makes it a first-class Parameter Golf lever rather than a cosmetic one.

The key tension

A tokenizer can help in one place while hurting in another:

larger vocabularies can reduce token count
smaller vocabularies can make the output path cheaper
domain-targeted vocabularies can act like cheap model capacity
poorly matched vocabularies can waste both compute and bytes

So the question is not “which tokenizer compresses text best?” but “which tokenizer gives the best full-model tradeoff?”

Evidence trail

ReTok shows tokenizer replacement can yield meaningful efficiency gains without retraining the whole model. (Gu et al., 2024)
Beyond Text Compression shows compression alone is not enough to judge tokenizer quality, especially outside easy English-only settings. (Lotz et al., 2025)
Vocabulary Compression for Low-Compute Environments shows the logits side can become a major memory and throughput bottleneck. (Vennam et al., 2024)

Working takeaway

Tokenizers are coupled to the backbone through the LM head. That means tokenizer choice should be evaluated together with:

Gu, S., Zhao, M., Zhang, B., Wang, L., Li, J., & Liu, G. (2024). ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model. arXiv Preprint arXiv:2410.04335. https://arxiv.org/abs/2410.04335

Lotz, J. F., Lopes, A. V., Peitz, S., Setiawan, H., & Emili, L. (2025). Beyond Text Compression: Evaluating Tokenizers Across Scales. arXiv Preprint arXiv:2506.03101. https://arxiv.org/abs/2506.03101

Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371

Parameter Golf Research Garden

Section Tree

Tokenizer Efficiency

The mistake to avoid

Why it matters here

The key tension

Evidence trail

Working takeaway

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Tokenizer Efficiency

The mistake to avoid

Why it matters here

The key tension

Evidence trail

Working takeaway

Related

Graph View

Table of Contents

Referenced by

Recent notes