The Disharmony Between BN and ReLU Causes Gradient Explosion, but is Offset by the Correlation Between Activations

04/23/2023
by   Inyoung Paik, et al.
0

Deep neural networks based on batch normalization and ReLU-like activation functions can experience instability during the early stages of training due to the high gradient induced by temporal gradient explosion. We explain how ReLU reduces variance more than expected, and how batch normalization amplifies the gradient during recovery, which causes gradient explosion while forward propagation remains stable. Additionally, we discuss how the dynamics of a deep neural network change during training and how the correlation between inputs can alleviate this problem. Lastly, we propose a better adaptive learning rate algorithm inspired by second-order optimization algorithms, which outperforms existing learning rate scaling methods in large batch training and can also replace WarmUp in small batch training.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset