(David-Hay & Wolf, 2024)

Sources: arXiv:2401.12819 · alphaXiv overview

Core contribution

Dynamic Layer Tying trains a controller that repeatedly decides whether each layer should remain independent or copy the weights of an earlier layer. The important claim is not just that sharing can work, but that the pattern of which layers deserve uniqueness can be discovered during training instead of fixed by hand.

Why this matters for Parameter Golf

The most interesting implication is that depth uniqueness may be much sparser than we assume. If only a small minority of layers truly need their own parameters, then a hard-cap artifact may benefit more from a learned or adaptive tying pattern than from paying for a fully unique stack.

What to import

  • Full uniqueness is probably overbought. Many layers may contribute little that could not be recovered by reuse.
  • The tying pattern is itself an optimization target. We should not assume equally spaced sharing is optimal.
  • Shared training can regularize. Reuse is not only a memory trick; it can also improve optimization.

What not to over-import

The paper trains from scratch with a dynamic controller, which is heavier than the kind of local loop we can iterate quickly. Its exact RL machinery is less likely to transfer than the broader lesson that adaptive sharing patterns may outperform fixed, hand-designed repetition.

Parameter Golf translation

A practical translation is to search not only over “shared vs unshared,” but over where uniqueness is most valuable:

  • a few untied early layers
  • a mostly shared middle trunk
  • tiny late-stage specialization

That structure could be cheaper than either a fully unique stack or a brutally uniform recurrent core.

David-Hay, T., & Wolf, L. (2024). Dynamic Layer Tying for Parameter-Efficient Transformers. arXiv Preprint arXiv:2401.12819. https://arxiv.org/abs/2401.12819