Tokenizer choice changes sequence length, output-layer structure, and sometimes the artifact itself. In a compact-LLM setting, that makes it a first-order design lever.
Core question
Should a compact model spend more of its limited budget on:
- representing words and subwords directly through a large vocabulary, or
- using a smaller vocabulary and paying extra sequence length or modeling burden elsewhere?
The answer affects both compute and bytes.
Why this lane is underexplored
Compact-model work often starts from architecture or quantization, but tokenizer choices also determine:
- average tokens per byte of source text
- the size and shape of the LM head
- how much training compute is spent on logits
- how efficiently rare strings, code, or multilingual fragments are represented
Several papers suggest these effects can offset surprisingly large model-size differences in practice (Gu et al., 2024; Lotz et al., 2025).
Central papers
Three subproblems inside this lane
1. Sequence efficiency
A tokenizer can reduce total sequence length and therefore reduce repeated attention and MLP work.
2. Output-head burden
A larger vocabulary increases the LM head and logits cost, which can dominate earlier than many compact-model discussions admit.
See:
3. Domain targeting
A vocabulary matched to the data distribution may act like cheap model capacity, especially in specialized or multilingual settings.
Parameter Golf implications
- vocabulary size is both a modeling knob and a storage knob
- tokenizer quality should be judged with downstream efficiency metrics, not only text-compression ratio
- the output projection deserves separate conceptual treatment from the rest of model quantization
- architecture comparisons can be misleading if one tokenizer is quietly making the logits path much cheaper
Most relevant questions
- when does a smaller or better-targeted vocab beat a modest architecture change?
- can the LM head be compressed or factorized enough to rescue a larger vocabulary?
- does bits-per-byte scoring reward token count reduction enough to justify larger vocab storage?
- when do tokenizer gains survive once the whole artifact is compressed?
Related
- Output-head compression
- Tokenizer efficiency
- The LM head is part of the compression problem
- Training economics and small-model bottlenecks