(Üyük et al., 2024)

Sources: arXiv:2411.09816 · alphaXiv overview

Core contribution

FiPS argues that parameter sharing should be learned at a finer grain than whole-layer tying. By combining tensor decomposition and sparsity, it tries to preserve useful shared structure while still letting different parts of the network specialize.

Why this matters for Parameter Golf

This paper is valuable less as a literal recipe and more as a search-space expansion. It says the real choice is not just:

  • full independent layers, or
  • one shared recurrent block

There is a richer middle ground where the model shares bases, factors, or subspaces. That is exactly the kind of middle ground that could make recursive width scaling more robust under a hard artifact cap.

What to import

  • Sharing should be structured, not binary.
  • Sparsity is a natural companion to reuse. Shared structure and selective specialization can coexist.
  • Nearby functions may want nearby factors. Grouped or local sharing may be easier to exploit than arbitrary global tying.

What not to over-import

Tensor decompositions can introduce implementation and metadata complexity, and success in a paper setting does not mean the same factorization is the best use of bytes in a submission artifact. The main transferable lesson is to stop equating compression with strict tying.

Parameter Golf translation

FiPS suggests exploring designs like:

  • shared MLP bases with lightweight depth-specific factors
  • grouped sharing where neighboring steps share more than distant steps
  • partial tying that spends a small amount of bytes on specialization rather than on fully separate blocks
Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816