Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning read more
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation read more