Regenerated LM Head

Moonshot

Treat the LM head as a mostly derived object.

Instead of storing a full row for most vocabulary items, store:

shared lexical or semantic factors
token descriptors
a tiny row generator
explicit residual rows only for the most valuable tokens

Why this is outside the current prior

Current tokenizer/vocabulary work still mostly assumes the head is a table to shrink, tie, or quantize. This moonshot asks whether the head should be reconstructed on demand from a much smaller set of primitives.

Mechanism sketch

A token row could be generated from:

byte or morpheme factors
prototype family IDs
small correction scalars
optional residual rows for top-saliency tokens

That would make the stored head much smaller and force head structure to be more compressible.

Why it might matter for Parameter Golf

The output head can be one of the steepest byte sinks in compact models. If most rows are structurally redundant, storing them all explicitly may be one of the worst bargains in the artifact.

Cheapest falsifier

regenerate only a subset of low-frequency rows first
keep top-saliency rows explicit
compare equal-byte tradeoff versus ordinary head quantization

Kill it if regenerated rows destroy calibration too broadly or if the generator costs too much.

What would make it real

explicit rows needed only for a small saliency tail
strong quality retention at a sharply lower head byte budget
tokenizer/head co-design benefits rather than conflicts with the approach

Parameter Golf Research Garden

Section Tree

Regenerated LM Head

Moonshot

Why this is outside the current prior

Mechanism sketch

Why it might matter for Parameter Golf

Cheapest falsifier

What would make it real

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Regenerated LM Head

Moonshot

Why this is outside the current prior

Mechanism sketch

Why it might matter for Parameter Golf

Cheapest falsifier

What would make it real

Related

Graph View

Table of Contents

Referenced by

Recent notes