Hypothesis

A compact model may get a better 16 MB size-quality trade by deliberately shrinking the tokenizer / LM-head burden and spending the saved bytes on a wider shared-depth backbone rather than on a larger static vocabulary path.

The concrete bet is not just “use a smaller vocab.” It is:

  • choose a tokenizer that is cheaper to store and cheaper to score
  • accept some increase in sequence length if needed
  • reinvest the recovered bytes into a stronger recurrent/shared core
  • retune only the embedding / head alignment aggressively, in the spirit of ReTok (Gu et al., 2024)

Mechanism sketch

A falsifiable version would look like:

  • a somewhat smaller or more domain-matched vocabulary than the current baseline
  • tied or lightly factorized embedding / LM-head weights
  • a wider shared block or fewer unique blocks in the trunk
  • optional tiny phase-specific norms or gates, not full unshared depth

This makes the vocabulary path a budget donor and the recurrent trunk the budget recipient.

Why this might work

Three separate literatures point in the same direction:

The new connection is that these are not separate decisions. If the head is one of the largest unique parameter sinks, reducing it may be the cleanest way to afford a better shared trunk.

Evidence threads

What would falsify it

This idea should be considered weaker if any of the following happen at matched artifact sizes:

  1. the smaller or retargeted vocabulary increases sequence length enough to erase the gain
  2. head-size savings are smaller than expected after actual compression
  3. the wider shared trunk does not convert the freed bytes into lower post-roundtrip val_bpb
  4. most of the benefit comes from tokenizer-specific overfitting rather than a durable head-to-trunk exchange

Why it matters under the 16 MB cap

The head, embeddings, and tokenizer assets are among the most obvious places where a compact model stores large unique objects. Shared depth, by contrast, turns extra capacity into reusable computation.

If this hypothesis is right, a serious submission should stop asking only “how do we quantize the current head better?” and start asking “how many of those head bytes should exist at all?”

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672
Gu, S., Zhao, M., Zhang, B., Wang, L., Li, J., & Liu, G. (2024). ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model. arXiv Preprint arXiv:2410.04335. https://arxiv.org/abs/2410.04335
Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371