Token-Adaptive Recurrent Refinement

Hypothesis

A compact recurrent model may outperform a larger static artifact if it can allocate extra shared-block passes only to uncertain tokens or positions, rather than applying the same depth everywhere.

This is stricter than generic iterative refinement because it predicts that where extra compute is spent matters as much as how much is spent.

Mechanism sketch

A testable version would use:

one shared or mostly shared backbone block
a cheap uncertainty signal such as logit margin, entropy, or residual norm
one bounded extra refinement pass for only the hardest positions or segments
hard caps on extra compute so runtime remains challenge-legal

The idea is basically “sparse MoE thinking, but in time instead of parameter count.”

Why this might work

This page combines evidence that usually lives in different lanes:

Inference Scaling Laws suggests more test-time compute can substitute for some stored capability (Wu et al., 2024)
MoEUT suggests sparse extra capacity can make shared-depth models much more competitive (Csordás et al., 2024)
Computational Bottlenecks of Training SLMs suggests compact models should care about where compute is actually spent, not just total FLOPs (Ashkboos et al., 2024)

The new connection is that recurrence provides the reusable block, while uncertainty routing decides where extra passes are worth paying for.

Evidence threads

Evaluation-time compute and inference scaling already frames compute as a substitute for stored bytes.
Recursive and shared-parameter architectures provides the natural shared block to reuse.
Compute-for-storage exchange makes the budget argument explicit.

What would falsify it

This idea should lose priority if:

simple uniform extra depth beats token-adaptive refinement once wall-clock is controlled
uncertainty estimates are too noisy to identify where extra passes help
routing logic or masking overhead cancels the compute savings
the improvement appears only on cherry-picked examples rather than realistic evaluation mixes

Why it matters under the 16 MB cap

The main attraction is that it spends almost no extra bytes. The artifact stores one compact recurrent core and a tiny routing rule, then buys extra quality with bounded inference-time computation.

If the challenge cap keeps squeezing stored parameters harder than runtime, this kind of targeted compute may become more attractive than another round of clever weight packing.

Ashkboos, S., Mirzadeh, I., Alizadeh, K., Sekhavat, M. H., Nabi, M., Farajtabar, M., & Faghri, F. (2024). Computational Bottlenecks of Training Small-scale Large Language Models. arXiv Preprint arXiv:2410.19456. https://arxiv.org/abs/2410.19456

Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039

Wu, Y., Sun, Z., Li, S., Welleck, S., & Yang, Y. (2024). Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models. arXiv Preprint arXiv:2408.00724. https://arxiv.org/abs/2408.00724

Parameter Golf Research Garden

Section Tree

Token-Adaptive Recurrent Refinement

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Token-Adaptive Recurrent Refinement

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Related

Graph View

Table of Contents

Referenced by

Recent notes