Stable Weight Decay Regularization

11/23/2020
by   Zeke Xie, et al.
9

Weight decay is a popular regularization technique for training of deep neural networks. Modern deep learning libraries mainly use L_2 regularization as the default implementation of weight decay. <cit.> demonstrated that L_2 regularization is not identical to weight decay for adaptive gradient methods, such as Adaptive Momentum Estimation (Adam), and proposed Adam with Decoupled Weight Decay (AdamW). However, we found that the popular implementations of weight decay, including L_2 regularization and decoupled weight decay, in modern deep learning libraries usually damage performance. First, the L_2 regularization is unstable weight decay for all optimizers that use Momentum, such as stochastic gradient descent (SGD). Second, decoupled weight decay is highly unstable for all adaptive gradient methods. We further propose the Stable Weight Decay (SWD) method to fix the unstable weight decay problem from a dynamical perspective. The proposed SWD method makes significant improvements over L_2 regularization and decoupled weight decay in our experiments. Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can usually outperform complex Adam variants, which have more hyperparameters.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset