Local SGD Converges Fast and Communicates Little

05/24/2018
by   Sebastian U. Stich, et al.
0

Mini-batch stochastic gradient descent (SGD) is the state of the art in large scale parallel machine learning, but its scalability is limited by a communication bottleneck. Recent work proposed local SGD, i.e. running SGD independently in parallel on different workers and averaging only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis. We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speed-up in the number of workers and mini-batch size. Moreover, the number of communication rounds can be reduced up to a factor of T^1/2---where T denotes the number of total steps---compared to mini-batch SGD.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset