Hypothesis

A compact recurrent model may outperform a larger static artifact if it can allocate extra shared-block passes only to uncertain tokens or positions, rather than applying the same depth everywhere.

This is stricter than generic iterative refinement because it predicts that where extra compute is spent matters as much as how much is spent.

Mechanism sketch

A testable version would use:

  • one shared or mostly shared backbone block
  • a cheap uncertainty signal such as logit margin, entropy, or residual norm
  • one bounded extra refinement pass for only the hardest positions or segments
  • hard caps on extra compute so runtime remains challenge-legal

The idea is basically “sparse MoE thinking, but in time instead of parameter count.”

Why this might work

This page combines evidence that usually lives in different lanes:

The new connection is that recurrence provides the reusable block, while uncertainty routing decides where extra passes are worth paying for.

Evidence threads

What would falsify it

This idea should lose priority if:

  1. simple uniform extra depth beats token-adaptive refinement once wall-clock is controlled
  2. uncertainty estimates are too noisy to identify where extra passes help
  3. routing logic or masking overhead cancels the compute savings
  4. the improvement appears only on cherry-picked examples rather than realistic evaluation mixes

Why it matters under the 16 MB cap

The main attraction is that it spends almost no extra bytes. The artifact stores one compact recurrent core and a tiny routing rule, then buys extra quality with bounded inference-time computation.

If the challenge cap keeps squeezing stored parameters harder than runtime, this kind of targeted compute may become more attractive than another round of clever weight packing.

Ashkboos, S., Mirzadeh, I., Alizadeh, K., Sekhavat, M. H., Nabi, M., Farajtabar, M., & Faghri, F. (2024). Computational Bottlenecks of Training Small-scale Large Language Models. arXiv Preprint arXiv:2410.19456. https://arxiv.org/abs/2410.19456
Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039
Wu, Y., Sun, Z., Li, S., Welleck, S., & Yang, Y. (2024). Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models. arXiv Preprint arXiv:2408.00724. https://arxiv.org/abs/2408.00724