Hypothesis
Most of the post-compression damage in the embedding / LM-head path may come from a small subset of vocabulary rows rather than from the whole matrix.
If so, the right move is not to protect the entire head, but to identify and preserve only rows associated with tokens that are simultaneously:
- high frequency or high loss contribution
- high entropy / easily confused in context
- unusually sensitive to quantization
Mechanism sketch
A disciplined version of the idea would:
- quantize the full embedding / head normally
- compute a token-row sensitivity score from validation loss change, row error, or gradient statistics
- keep only the top-ranked rows in a higher-precision residual path
- spend the saved bytes on a larger trunk, better norms, or simply more headroom under the cap
This is basically pQuant logic applied at the vocabulary-row level rather than the whole-tensor level.
Why this might work
The supporting evidence comes from three directions:
- pQuant argues that parameter sensitivity is extremely uneven, so uniform precision is wasteful (Zhang et al., 2026)
- Vocabulary Compression argues the output side becomes a major burden early, so even small head savings are meaningful (Vennam et al., 2024)
- Beyond Text Compression argues tokenizer quality is not captured by compression ratio alone, implying some tokens matter much more than average (Lotz et al., 2025)
The new connection is that tokenizer difficulty and quantization sensitivity may align. If a small set of tokens already dominates uncertainty, those may also be the rows worth rescuing.
Evidence threads
- Tokenizer and vocabulary efficiency says vocab design is a first-order budget choice.
- Quantization and outliers says selective precision usually beats democratic precision when sensitivity is skewed.
- The LM head is part of the compression problem makes head-side exceptions more plausible than protecting arbitrary backbone tensors.
What would falsify it
This idea should be rejected if:
- quantization harm is diffuse across the vocabulary rather than concentrated in a small row set
- the selected rows are unstable across data slices, making the rescue set brittle
- row metadata and exception handling cost too many bytes relative to the gain
- protecting rows helps token-level loss but does not improve final bits-per-byte on the scored artifact
Why it matters under the 16 MB cap
Protecting even a few hundred vocabulary rows can be dramatically cheaper than raising precision for the full head or embedding matrix.
That matters because the cap punishes broad generosity. If head damage is truly concentrated, row-level rescue could be one of the highest-leverage uses of leftover bytes in the entire submission.
Related
- Sparse outlier preservation
- Output-head compression
- Tokenizer and vocabulary efficiency
- Quantization and outliers
- pQuant