The mistake to avoid
Tokenizer work is easy to treat as preprocessing. In compact-model settings it is part of the core systems design.
Why it matters here
A tokenizer changes:
- sequence length
- output-layer structure
- training throughput
- evaluation-time cost
- sometimes the artifact budget itself
That makes it a first-class Parameter Golf lever rather than a cosmetic one.
The key tension
A tokenizer can help in one place while hurting in another:
- larger vocabularies can reduce token count
- smaller vocabularies can make the output path cheaper
- domain-targeted vocabularies can act like cheap model capacity
- poorly matched vocabularies can waste both compute and bytes
So the question is not “which tokenizer compresses text best?” but “which tokenizer gives the best full-model tradeoff?”
Evidence trail
- ReTok shows tokenizer replacement can yield meaningful efficiency gains without retraining the whole model. (Gu et al., 2024)
- Beyond Text Compression shows compression alone is not enough to judge tokenizer quality, especially outside easy English-only settings. (Lotz et al., 2025)
- Vocabulary Compression for Low-Compute Environments shows the logits side can become a major memory and throughput bottleneck. (Vennam et al., 2024)
Working takeaway
Tokenizers are coupled to the backbone through the LM head. That means tokenizer choice should be evaluated together with:
Related
- Tokenizer and vocabulary efficiency
- The LM head is part of the compression problem
- Output-head compression
Gu, S., Zhao, M., Zhang, B., Wang, L., Li, J., & Liu, G. (2024). ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model. arXiv Preprint arXiv:2410.04335. https://arxiv.org/abs/2410.04335
Lotz, J. F., Lopes, A. V., Peitz, S., Setiawan, H., & Emili, L. (2025). Beyond Text Compression: Evaluating Tokenizers Across Scales. arXiv Preprint arXiv:2506.03101. https://arxiv.org/abs/2506.03101
Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371