Moonshot

Treat the LM head as a mostly derived object.

Instead of storing a full row for most vocabulary items, store:

  • shared lexical or semantic factors
  • token descriptors
  • a tiny row generator
  • explicit residual rows only for the most valuable tokens

Why this is outside the current prior

Current tokenizer/vocabulary work still mostly assumes the head is a table to shrink, tie, or quantize. This moonshot asks whether the head should be reconstructed on demand from a much smaller set of primitives.

Mechanism sketch

A token row could be generated from:

  • byte or morpheme factors
  • prototype family IDs
  • small correction scalars
  • optional residual rows for top-saliency tokens

That would make the stored head much smaller and force head structure to be more compressible.

Why it might matter for Parameter Golf

The output head can be one of the steepest byte sinks in compact models. If most rows are structurally redundant, storing them all explicitly may be one of the worst bargains in the artifact.

Cheapest falsifier

  • regenerate only a subset of low-frequency rows first
  • keep top-saliency rows explicit
  • compare equal-byte tradeoff versus ordinary head quantization

Kill it if regenerated rows destroy calibration too broadly or if the generator costs too much.

What would make it real

  • explicit rows needed only for a small saliency tail
  • strong quality retention at a sharply lower head byte budget
  • tokenizer/head co-design benefits rather than conflicts with the approach