Secondary gradient descent in higher codimension
In this paper, we analyze discrete gradient descent and ϵ-noisy gradient descent on a special but important class of functions. We find that when used to minimize a function L: R^n →R in this class we consider, discrete gradient descent can exhibit strikingly different behavior from continuous gradient descent. On long time scales, discrete gradient descent and continuous gradient descent tend toward different global minima of L. Discrete gradient descent preferentially finds global minima at which the graph of the function L is shallowest, while gradient flow shows no such preference.
READ FULL TEXT