Head-to-Depth Budget Swap

Hypothesis

A compact model may get a better 16 MB size-quality trade by deliberately shrinking the tokenizer / LM-head burden and spending the saved bytes on a wider shared-depth backbone rather than on a larger static vocabulary path.

The concrete bet is not just “use a smaller vocab.” It is:

choose a tokenizer that is cheaper to store and cheaper to score
accept some increase in sequence length if needed
reinvest the recovered bytes into a stronger recurrent/shared core
retune only the embedding / head alignment aggressively, in the spirit of ReTok (Gu et al., 2024)

Mechanism sketch

A falsifiable version would look like:

a somewhat smaller or more domain-matched vocabulary than the current baseline
tied or lightly factorized embedding / LM-head weights
a wider shared block or fewer unique blocks in the trunk
optional tiny phase-specific norms or gates, not full unshared depth

This makes the vocabulary path a budget donor and the recurrent trunk the budget recipient.

Why this might work

Three separate literatures point in the same direction:

Vocabulary Compression argues that logits and output memory become a bottleneck surprisingly early (Vennam et al., 2024)
ReTok argues that tokenizer changes can be absorbed mostly through embeddings and the LM head rather than a full architecture reset (Gu et al., 2024)
Relaxed Recursive Transformers suggests that saved bytes are often better spent on a stronger shared backbone than on many unique layers (Bae et al., 2024)

The new connection is that these are not separate decisions. If the head is one of the largest unique parameter sinks, reducing it may be the cleanest way to afford a better shared trunk.

Evidence threads

Tokenizer and vocabulary efficiency frames the head as part of the artifact problem, not only a preprocessing choice.
Training economics argues small models hit logits and sequence-length bottlenecks earlier than expected.
Recursive and shared-parameter architectures gives a natural place to spend recovered bytes: width, stability, or tiny per-step specialization.

What would falsify it

This idea should be considered weaker if any of the following happen at matched artifact sizes:

the smaller or retargeted vocabulary increases sequence length enough to erase the gain
head-size savings are smaller than expected after actual compression
the wider shared trunk does not convert the freed bytes into lower post-roundtrip val_bpb
most of the benefit comes from tokenizer-specific overfitting rather than a durable head-to-trunk exchange

Why it matters under the 16 MB cap

The head, embeddings, and tokenizer assets are among the most obvious places where a compact model stores large unique objects. Shared depth, by contrast, turns extra capacity into reusable computation.

If this hypothesis is right, a serious submission should stop asking only “how do we quantize the current head better?” and start asking “how many of those head bytes should exist at all?”

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672

Gu, S., Zhao, M., Zhang, B., Wang, L., Li, J., & Liu, G. (2024). ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model. arXiv Preprint arXiv:2410.04335. https://arxiv.org/abs/2410.04335

Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371

Parameter Golf Research Garden

Section Tree

Head-to-Depth Budget Swap

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Head-to-Depth Budget Swap

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Related

Graph View

Table of Contents

Referenced by

Recent notes