10 items with this tag.
moonshots
Moonshot hypothesis that many apparently different tensors could be stored as one canonical prototype plus cheap transport maps instead of as separate weights.
papers
Paper note on using reinforcement learning during training to decide which transformer layers should share weights and which should remain independent.
hypotheses
Hypothesis that tiny per-depth conditioning can recover much of the specialization lost by strict parameter sharing.
hypotheses
Hypothesis that storing fewer unique layers and spending the savings on width or lightweight per-layer adaptation is a better artifact trade than many fully unique blocks.
ideas
Hypothesis that shrinking tokenizer and LM-head burden, then reinvesting the saved bytes into a wider shared backbone, beats spending the same budget on a larger static head.
lanes
Why parameter sharing may be the cleanest way to buy width, extra compute, or light specialization under a hard artifact cap.
notes
Synthesis note on why shared-depth transformer designs are attractive under a hard artifact budget, and where they usually break.
papers
Paper note on cross-layer parameter sharing and factorized embeddings as two clean ways to reduce stored parameters without simply shrinking hidden capacity.
papers
Paper note on learning structured parameter sharing with tensor decompositions and sparsity instead of treating sharing as all-or-nothing layer tying.
papers
Paper note on converting pretrained transformers into recursive/shared-parameter models with lightweight depth-specific relaxation.