Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent
In this paper, we study the stochastic gradient descent method in analyzing nonconvex statistical optimization problems from a diffusion approximation point of view. Using the theory of large deviation of random dynamical system, we prove in the small stepsize regime and the presence of omnidirectional noise the following: starting from a local minimizer (resp. saddle point) the SGD iteration escapes in a number of iteration that is exponentially (resp. linearly) dependent on the inverse stepsize. We take the deep neural network as an example to study this phenomenon. Based on a new analysis of the mixing rate of multidimensional Ornstein-Uhlenbeck processes, our theory substantiate a very recent empirical results by keskar2016large, suggesting that large batch sizes in training deep learning for synchronous optimization leads to poor generalization error.
READ FULL TEXT