How to Train 10,000-Layer Vanilla Convolutional Neural Networks
QUESTION
- train architectures that are currently untrainbalbe?
- eliminate the need to search over hyperparameters?
- disentangle trainablilty, expressivity, and gerneralization?
MOTIVATION : train the untrainable, eliminate hyperparameters, disentangle contributions to sucess
- Signal propagation in deep networks : predicts trainability by examing weather correlations between inputs survive with depth
- Mean field analysis : Vanishing / exploding gradients
- Dynamical isometry : ensure well conditioned Jacobian
- Delta orthogonal : guarantee survival of Fourier modes and enable training of 10,000 layer vanilla CNNs.
CONCLUSION
- Developed a mean field theory to understand signal propagation in deep CNNs
- Developed connection between Fourier modes and generalization
- Two new initialization methods:
- Random orthogonal kernels
- Trained 10k layer tanh network w/o use of batch norm or residual connections, and w/o reduction in test accuracy