Convergence of Alternating Gradient Descent for Matrix Factorization
We consider alternating gradient descent (AGD) with fixed step size η > 0, applied to the asymmetric matrix factorization objective. We show that, for a rank-r matrix 𝐀∈ℝ^m × n, T = ( (σ_1(𝐀)/σ_r(𝐀))^2 log(1/ϵ)) iterations of alternating gradient descent suffice to reach an ϵ-optimal factorization 𝐀 - 𝐗_T^𝐘_T^⊺_ F^2 ≤ϵ𝐀_ F^2 with high probability starting from an atypical random initialization. The factors have rank d>r so that 𝐗_T∈ℝ^m × d and 𝐘_T ∈ℝ^n × d. Experiments suggest that our proposed initialization is not merely of theoretical benefit, but rather significantly improves convergence of gradient descent in practice. Our proof is conceptually simple: a uniform PL-inequality and uniform Lipschitz smoothness constant are guaranteed for a sufficient number of iterations, starting from our random initialization. Our proof method should be useful for extending and simplifying convergence analyses for a broader class of nonconvex low-rank factorization problems.
READ FULL TEXT