ALBERT

(Lan et al., 2020)

Sources: arXiv:1909.11942 · alphaXiv overview

Core contribution

ALBERT introduces two especially durable compression ideas:

cross-layer parameter sharing, which reduces duplicated depth parameters
factorized embedding parameterization, which lowers the cost of large vocabularies without shrinking hidden size in the same way

Although the paper is framed around BERT-style pretraining, both ideas remain highly relevant to any setting where stored weights, especially embedding and depth weights, are disproportionately expensive.

Why this matters for Parameter Golf

ALBERT is one of the cleanest older precedents for the central artifact-cap intuition in this garden: the best way to save parameters is often not to make every dimension smaller. Instead, identify which parts of the network are structurally repetitive or over-parameterized and compress those first.

What to import

Depth parameters can be shared much more aggressively than naive design suggests.
Embedding cost deserves separate treatment. Factorization is often a better lever than blunt vocabulary shrinkage.
Stored capacity and representational capacity are not the same thing.

What not to over-import

ALBERT is an encoder paper, not a direct recipe for an autoregressive Parameter Golf submission. It does not settle whether the same sharing pattern is optimal for decoder-only language models. Its main value here is conceptual and architectural, not prescriptive.

Best synthesis links

Serves as an early precursor to Relaxed Recursive Transformers and Fine-grained Parameter Sharing.
Bridges recursive sharing with tokenizer and vocabulary efficiency because it attacks both depth duplication and embedding cost.
Supports recursive layer sharing and tokenizer efficiency from a parameter-budget perspective.

Parameter Golf translation

ALBERT suggests two recurring questions:

are we paying too many bytes for repeated depth-specific weights that could be shared or lightly relaxed?
are embeddings and output-side matrices being compressed with enough care relative to the trunk?

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv Preprint arXiv:1909.11942. https://arxiv.org/abs/1909.11942

Parameter Golf Research Garden

Section Tree

ALBERT

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

ALBERT

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes