2 items with this tag.
papers
Paper note on why compact-model training has a different systems bottleneck profile than many big-model intuitions suggest.
papers
Paper note on reducing output-layer memory and logits cost by restructuring vocabulary prediction.