Tokenizer choice changes sequence length, output-layer structure, and sometimes the artifact itself. In a compact-LLM setting, that makes it a first-order design lever.

Core question

Should a compact model spend more of its limited budget on:

  • representing words and subwords directly through a large vocabulary, or
  • using a smaller vocabulary and paying extra sequence length or modeling burden elsewhere?

The answer affects both compute and bytes.

Why this lane is underexplored

Compact-model work often starts from architecture or quantization, but tokenizer choices also determine:

  • average tokens per byte of source text
  • the size and shape of the LM head
  • how much training compute is spent on logits
  • how efficiently rare strings, code, or multilingual fragments are represented

Several papers suggest these effects can offset surprisingly large model-size differences in practice (Gu et al., 2024; Lotz et al., 2025).

Central papers

Three subproblems inside this lane

1. Sequence efficiency

A tokenizer can reduce total sequence length and therefore reduce repeated attention and MLP work.

2. Output-head burden

A larger vocabulary increases the LM head and logits cost, which can dominate earlier than many compact-model discussions admit.

See:

3. Domain targeting

A vocabulary matched to the data distribution may act like cheap model capacity, especially in specialized or multilingual settings.

Parameter Golf implications

  • vocabulary size is both a modeling knob and a storage knob
  • tokenizer quality should be judged with downstream efficiency metrics, not only text-compression ratio
  • the output projection deserves separate conceptual treatment from the rest of model quantization
  • architecture comparisons can be misleading if one tokenizer is quietly making the logits path much cheaper

Most relevant questions

  • when does a smaller or better-targeted vocab beat a modest architecture change?
  • can the LM head be compressed or factorized enough to rescue a larger vocabulary?
  • does bits-per-byte scoring reward token count reduction enough to justify larger vocab storage?
  • when do tokenizer gains survive once the whole artifact is compressed?
Gu, S., Zhao, M., Zhang, B., Wang, L., Li, J., & Liu, G. (2024). ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model. arXiv Preprint arXiv:2410.04335. https://arxiv.org/abs/2410.04335
Lotz, J. F., Lopes, A. V., Peitz, S., Setiawan, H., & Emili, L. (2025). Beyond Text Compression: Evaluating Tokenizers Across Scales. arXiv Preprint arXiv:2506.03101. https://arxiv.org/abs/2506.03101