Tokenizer and Vocabulary Efficiency

Tokenizer choice changes sequence length, output-layer structure, and sometimes the artifact itself. In a compact-LLM setting, that makes it a first-order design lever.

Core question

Should a compact model spend more of its limited budget on:

representing words and subwords directly through a large vocabulary, or
using a smaller vocabulary and paying extra sequence length or modeling burden elsewhere?

The answer affects both compute and bytes.

Why this lane is underexplored

Compact-model work often starts from architecture or quantization, but tokenizer choices also determine:

average tokens per byte of source text
the size and shape of the LM head
how much training compute is spent on logits
how efficiently rare strings, code, or multilingual fragments are represented

Several papers suggest these effects can offset surprisingly large model-size differences in practice (Gu et al., 2024; Lotz et al., 2025).

Central papers

Three subproblems inside this lane

1. Sequence efficiency

A tokenizer can reduce total sequence length and therefore reduce repeated attention and MLP work.

2. Output-head burden

A larger vocabulary increases the LM head and logits cost, which can dominate earlier than many compact-model discussions admit.

See:

3. Domain targeting

A vocabulary matched to the data distribution may act like cheap model capacity, especially in specialized or multilingual settings.

Parameter Golf implications

vocabulary size is both a modeling knob and a storage knob
tokenizer quality should be judged with downstream efficiency metrics, not only text-compression ratio
the output projection deserves separate conceptual treatment from the rest of model quantization
architecture comparisons can be misleading if one tokenizer is quietly making the logits path much cheaper

Most relevant questions

when does a smaller or better-targeted vocab beat a modest architecture change?
can the LM head be compressed or factorized enough to rescue a larger vocabulary?
does bits-per-byte scoring reward token count reduction enough to justify larger vocab storage?
when do tokenizer gains survive once the whole artifact is compressed?

Gu, S., Zhao, M., Zhang, B., Wang, L., Li, J., & Liu, G. (2024). ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model. arXiv Preprint arXiv:2410.04335. https://arxiv.org/abs/2410.04335

Lotz, J. F., Lopes, A. V., Peitz, S., Setiawan, H., & Emili, L. (2025). Beyond Text Compression: Evaluating Tokenizers Across Scales. arXiv Preprint arXiv:2506.03101. https://arxiv.org/abs/2506.03101

Parameter Golf Research Garden

Section Tree

Tokenizer and Vocabulary Efficiency

Core question

Why this lane is underexplored

Central papers

Three subproblems inside this lane

1. Sequence efficiency

2. Output-head burden

3. Domain targeting

Parameter Golf implications

Most relevant questions

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Tokenizer and Vocabulary Efficiency

Core question

Why this lane is underexplored

Central papers

Three subproblems inside this lane

1. Sequence efficiency

2. Output-head burden

3. Domain targeting

Parameter Golf implications

Most relevant questions

Related

Graph View

Table of Contents

Referenced by

Recent notes