Momentum is known to accelerate the convergence of gradient descent in
s...
Local SGD is a communication-efficient variant of SGD for large-scale
tr...
It is believed that Gradient Descent (GD) induces an implicit bias towar...
Saliency methods compute heat maps that highlight portions of an input t...
Normalization layers (e.g., Batch Normalization, Layer Normalization) we...
Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differen...
The generalization mystery of overparametrized deep nets has motivated
e...
Matrix factorization is a simple and natural test-bed to investigate the...
Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popula...
Recent works on implicit regularization have shown that gradient descent...
Batch Normalization (BN) has become a cornerstone of deep learning acros...
In a directed graph G=(V,E) with a capacity on every edge, a
bottleneck ...
In this paper we study the fine-grained complexity of finding exact and
...