High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize

04/06/2022
by   Ali Kavis, et al.
0

In this paper, we propose a new, simplified high probability analysis of AdaGrad for smooth, non-convex problems. More specifically, we focus on a particular accelerated gradient (AGD) template (Lan, 2020), through which we recover the original AdaGrad and its variant with averaging, and prove a convergence rate of 𝒪 (1/ √(T)) with high probability without the knowledge of smoothness and variance. We use a particular version of Freedman's concentration bound for martingale difference sequences (Kakade Tewari, 2008) which enables us to achieve the best-known dependence of log (1 / δ ) on the probability margin δ. We present our analysis in a modular way and obtain a complementary 𝒪 (1 / T) convergence rate in the deterministic setting. To the best of our knowledge, this is the first high probability result for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness and uniform variance bound, which simultaneously has best-known dependence of log( 1/ δ). We further prove noise adaptation property of AdaGrad under additional noise assumptions.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro