Sources: arXiv:2505.10202 · alphaXiv overview
Core contribution
VQ-Logits attacks the output bottleneck directly by replacing the full vocabulary-sized logits projection with a compact vector-quantized codebook. The key claim is that the output head can often be compressed more structurally than a plain low-rank or tied-embedding treatment suggests, because many vocabulary items can share a smaller predictive basis.
Why this matters for Parameter Golf
This is one of the clearest papers in the shelf for output-head compression. It makes the output side feel like a real artifact-budget lever rather than a theoretical annoyance. In a tiny model, saving head bytes can be as meaningful as shaving another fraction of a bit from the trunk.
What to import
- The logits projection is a compressible structure, not just a fixed tax.
- A compact codebook may buy better head tradeoffs than naive low-rank factorization alone.
- Vocabulary semantics and output parameterization should be designed together.
What not to over-import
The paper does not prove that codebook-style output compression wins after all bookkeeping, mapping, and implementation overheads are counted inside a strict challenge artifact. The stable lesson is that the head deserves its own structured compression search, not that any specific VQ scheme is automatically byte-optimal.
Best synthesis links
- Directly extends the LM head budget note.
- Complements Vocabulary Compression for Low-Compute Environments by making the output bottleneck more explicit and more compressible.
- Pairs with The LM Head is a Gradient Bottleneck, which argues that the head is also an optimization bottleneck, not just a storage one.
Parameter Golf translation
VQ-Logits motivates asking:
- how many bytes are tied up in the head versus the trunk,
- whether head restructuring can buy more than another round of backbone compression,
- and whether tokenizer changes should be evaluated jointly with output-head compression schemes.
Related
- The LM Head is a Gradient Bottleneck
- Vocabulary Compression for Low-Compute Environments
- ALBERT
- The LM Head Is Part of the Compression Problem
- Output-head compression
- Tokenizer and vocabulary efficiency