Moonshot

Let one shared block behave like many conceptual layers because it carries a tiny persistent role state across passes.

Instead of storing distinct layer parameters, store a tiny internal program state that says things like:

  • lexical pass
  • structural pass
  • consolidation pass
  • logits-sharpening pass

Why this is outside the current prior

Current shared-depth work usually keeps specialization in small stored adapters, LoRA slices, or norm scales. This moonshot moves specialization into dynamic state, not static parameter deltas.

That is a stronger departure from the normal transformer prior.

Mechanism sketch

  • one shared backbone block
  • one tiny recurrent role memory
  • role memory updates every pass
  • block reads token state + role state together
  • maybe only tiny norm/gating parameters are stored explicitly

Why it might matter for Parameter Golf

If specialization can live in state transitions instead of stored unique weights, then the artifact can stay tiny while effective depth behavior remains diverse.

This is especially attractive when evaluation-time compute is cheaper than permanently storing more unique layers.

Cheapest falsifier

  • inspect whether repeated passes actually separate into distinct role behavior
  • test whether role state collapses into one mode
  • compare against fixed pass-index conditioning

Kill it if role state adds complexity without producing distinct useful phases.

What would make it real

  • clear behavioral differentiation across passes
  • better post-roundtrip quality than static shared-depth at equal bytes
  • only tiny extra stored state machinery