The mistake to avoid

Tokenizer work is easy to treat as preprocessing. In compact-model settings it is part of the core systems design.

Why it matters here

A tokenizer changes:

  • sequence length
  • output-layer structure
  • training throughput
  • evaluation-time cost
  • sometimes the artifact budget itself

That makes it a first-class Parameter Golf lever rather than a cosmetic one.

The key tension

A tokenizer can help in one place while hurting in another:

  • larger vocabularies can reduce token count
  • smaller vocabularies can make the output path cheaper
  • domain-targeted vocabularies can act like cheap model capacity
  • poorly matched vocabularies can waste both compute and bytes

So the question is not “which tokenizer compresses text best?” but “which tokenizer gives the best full-model tradeoff?”

Evidence trail

Working takeaway

Tokenizers are coupled to the backbone through the LM head. That means tokenizer choice should be evaluated together with:

Gu, S., Zhao, M., Zhang, B., Wang, L., Li, J., & Liu, G. (2024). ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model. arXiv Preprint arXiv:2410.04335. https://arxiv.org/abs/2410.04335
Lotz, J. F., Lopes, A. V., Peitz, S., Setiawan, H., & Emili, L. (2025). Beyond Text Compression: Evaluating Tokenizers Across Scales. arXiv Preprint arXiv:2506.03101. https://arxiv.org/abs/2506.03101
Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371