14 items with this tag.
ideas
Original, falsifiable research bets that combine multiple papers and lanes into concrete Parameter Golf directions.
hypotheses
Hypothesis that a smaller recurrent model with bounded extra evaluation-time refinement can beat a larger static artifact under the same storage cap.
hypotheses
Hypothesis that compressing or restructuring the LM head can beat modest backbone improvements in compact language models.
hypotheses
Hypothesis that tiny per-depth conditioning can recover much of the specialization lost by strict parameter sharing.
hypotheses
Concrete architecture hypothesis: use aggressive depth sharing to buy much more width, then spend leftover bytes on stability and selective precision.
hypotheses
Hypothesis that storing fewer unique layers and spending the savings on width or lightweight per-layer adaptation is a better artifact trade than many fully unique blocks.
hypotheses
Hypothesis that extra RMSNorm before projections improves post-roundtrip quality by stabilizing low-bit training and export.
hypotheses
Hypothesis that protecting a tiny subset of highly sensitive parameters buys disproportionately large quality gains under a strict artifact cap.
hypotheses
Synthesis hypothesis that the strongest compact artifacts will combine shared depth, activation discipline, selective precision, and cheap specialization rather than relying on one trick alone.
ideas
Hypothesis that most head-side quantization damage is concentrated in a tiny set of difficult token rows, making row-level protection a better byte trade than uniform head precision.
ideas
Hypothesis that one small learned codebook bank shared across repeated blocks can beat per-matrix quantization by amortizing metadata and aligning compression with shared-depth structure.
ideas
Hypothesis that shrinking tokenizer and LM-head burden, then reinvesting the saved bytes into a wider shared backbone, beats spending the same budget on a larger static head.
ideas
Hypothesis that shared-depth models can recover most layer-role specialization using only per-step RMSNorm and tiny channel gates, with almost no byte cost.
ideas
Hypothesis that a compact shared-depth model should spend extra inference-time passes only on uncertain positions, turning compute into quality more efficiently than storing more static depth.