Hypothesis
A compact model may get a better 16 MB size-quality trade by deliberately shrinking the tokenizer / LM-head burden and spending the saved bytes on a wider shared-depth backbone rather than on a larger static vocabulary path.
The concrete bet is not just “use a smaller vocab.” It is:
- choose a tokenizer that is cheaper to store and cheaper to score
- accept some increase in sequence length if needed
- reinvest the recovered bytes into a stronger recurrent/shared core
- retune only the embedding / head alignment aggressively, in the spirit of ReTok (Gu et al., 2024)
Mechanism sketch
A falsifiable version would look like:
- a somewhat smaller or more domain-matched vocabulary than the current baseline
- tied or lightly factorized embedding / LM-head weights
- a wider shared block or fewer unique blocks in the trunk
- optional tiny phase-specific norms or gates, not full unshared depth
This makes the vocabulary path a budget donor and the recurrent trunk the budget recipient.
Why this might work
Three separate literatures point in the same direction:
- Vocabulary Compression argues that logits and output memory become a bottleneck surprisingly early (Vennam et al., 2024)
- ReTok argues that tokenizer changes can be absorbed mostly through embeddings and the LM head rather than a full architecture reset (Gu et al., 2024)
- Relaxed Recursive Transformers suggests that saved bytes are often better spent on a stronger shared backbone than on many unique layers (Bae et al., 2024)
The new connection is that these are not separate decisions. If the head is one of the largest unique parameter sinks, reducing it may be the cleanest way to afford a better shared trunk.
Evidence threads
- Tokenizer and vocabulary efficiency frames the head as part of the artifact problem, not only a preprocessing choice.
- Training economics argues small models hit logits and sequence-length bottlenecks earlier than expected.
- Recursive and shared-parameter architectures gives a natural place to spend recovered bytes: width, stability, or tiny per-step specialization.
What would falsify it
This idea should be considered weaker if any of the following happen at matched artifact sizes:
- the smaller or retargeted vocabulary increases sequence length enough to erase the gain
- head-size savings are smaller than expected after actual compression
- the wider shared trunk does not convert the freed bytes into lower post-roundtrip
val_bpb - most of the benefit comes from tokenizer-specific overfitting rather than a durable head-to-trunk exchange
Why it matters under the 16 MB cap
The head, embeddings, and tokenizer assets are among the most obvious places where a compact model stores large unique objects. Shared depth, by contrast, turns extra capacity into reusable computation.
If this hypothesis is right, a serious submission should stop asking only “how do we quantize the current head better?” and start asking “how many of those head bytes should exist at all?”
Related
- Output-head compression
- Recursive width scaling
- Tokenizer and vocabulary efficiency
- Recursive and shared-parameter architectures
- The LM head is part of the compression problem