Moonshot
Let one shared block behave like many conceptual layers because it carries a tiny persistent role state across passes.
Instead of storing distinct layer parameters, store a tiny internal program state that says things like:
- lexical pass
- structural pass
- consolidation pass
- logits-sharpening pass
Why this is outside the current prior
Current shared-depth work usually keeps specialization in small stored adapters, LoRA slices, or norm scales. This moonshot moves specialization into dynamic state, not static parameter deltas.
That is a stronger departure from the normal transformer prior.
Mechanism sketch
- one shared backbone block
- one tiny recurrent role memory
- role memory updates every pass
- block reads token state + role state together
- maybe only tiny norm/gating parameters are stored explicitly
Why it might matter for Parameter Golf
If specialization can live in state transitions instead of stored unique weights, then the artifact can stay tiny while effective depth behavior remains diverse.
This is especially attractive when evaluation-time compute is cheaper than permanently storing more unique layers.
Cheapest falsifier
- inspect whether repeated passes actually separate into distinct role behavior
- test whether role state collapses into one mode
- compare against fixed pass-index conditioning
Kill it if role state adds complexity without producing distinct useful phases.
What would make it real
- clear behavioral differentiation across passes
- better post-roundtrip quality than static shared-depth at equal bytes
- only tiny extra stored state machinery