Dynamic Layer Tying

(David-Hay & Wolf, 2024)

Sources: arXiv:2401.12819 · alphaXiv overview

Core contribution

Dynamic Layer Tying trains a controller that repeatedly decides whether each layer should remain independent or copy the weights of an earlier layer. The important claim is not just that sharing can work, but that the pattern of which layers deserve uniqueness can be discovered during training instead of fixed by hand.

Why this matters for Parameter Golf

The most interesting implication is that depth uniqueness may be much sparser than we assume. If only a small minority of layers truly need their own parameters, then a hard-cap artifact may benefit more from a learned or adaptive tying pattern than from paying for a fully unique stack.

What to import

Full uniqueness is probably overbought. Many layers may contribute little that could not be recovered by reuse.
The tying pattern is itself an optimization target. We should not assume equally spaced sharing is optimal.
Shared training can regularize. Reuse is not only a memory trick; it can also improve optimization.

What not to over-import

The paper trains from scratch with a dynamic controller, which is heavier than the kind of local loop we can iterate quickly. Its exact RL machinery is less likely to transfer than the broader lesson that adaptive sharing patterns may outperform fixed, hand-designed repetition.

Best synthesis links

Extends ALBERT from static sharing toward adaptive sharing.
Sits upstream of Relaxed Recursive Transformers, which suggests what to do once strict tying becomes too rigid.
Strengthens recursive and shared-parameter architectures by framing tie decisions as search variables rather than architecture constants.

Parameter Golf translation

A practical translation is to search not only over “shared vs unshared,” but over where uniqueness is most valuable:

a few untied early layers
a mostly shared middle trunk
tiny late-stage specialization

That structure could be cheaper than either a fully unique stack or a brutally uniform recurrent core.

David-Hay, T., & Wolf, L. (2024). Dynamic Layer Tying for Parameter-Efficient Transformers. arXiv Preprint arXiv:2401.12819. https://arxiv.org/abs/2401.12819

Parameter Golf Research Garden

Section Tree

Dynamic Layer Tying

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Dynamic Layer Tying

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes