Hypothesis
A compact recurrent model may outperform a larger static artifact if it can allocate extra shared-block passes only to uncertain tokens or positions, rather than applying the same depth everywhere.
This is stricter than generic iterative refinement because it predicts that where extra compute is spent matters as much as how much is spent.
Mechanism sketch
A testable version would use:
- one shared or mostly shared backbone block
- a cheap uncertainty signal such as logit margin, entropy, or residual norm
- one bounded extra refinement pass for only the hardest positions or segments
- hard caps on extra compute so runtime remains challenge-legal
The idea is basically “sparse MoE thinking, but in time instead of parameter count.”
Why this might work
This page combines evidence that usually lives in different lanes:
- Inference Scaling Laws suggests more test-time compute can substitute for some stored capability (Wu et al., 2024)
- MoEUT suggests sparse extra capacity can make shared-depth models much more competitive (Csordás et al., 2024)
- Computational Bottlenecks of Training SLMs suggests compact models should care about where compute is actually spent, not just total FLOPs (Ashkboos et al., 2024)
The new connection is that recurrence provides the reusable block, while uncertainty routing decides where extra passes are worth paying for.
Evidence threads
- Evaluation-time compute and inference scaling already frames compute as a substitute for stored bytes.
- Recursive and shared-parameter architectures provides the natural shared block to reuse.
- Compute-for-storage exchange makes the budget argument explicit.
What would falsify it
This idea should lose priority if:
- simple uniform extra depth beats token-adaptive refinement once wall-clock is controlled
- uncertainty estimates are too noisy to identify where extra passes help
- routing logic or masking overhead cancels the compute savings
- the improvement appears only on cherry-picked examples rather than realistic evaluation mixes
Why it matters under the 16 MB cap
The main attraction is that it spends almost no extra bytes. The artifact stores one compact recurrent core and a tiny routing rule, then buys extra quality with bounded inference-time computation.
If the challenge cap keeps squeezing stored parameters harder than runtime, this kind of targeted compute may become more attractive than another round of clever weight packing.
Related
- Iterative refinement over stored depth
- Recurrent wide architecture
- Evaluation-time compute and inference scaling
- Recursive and shared-parameter architectures
- Compute-for-storage exchange