Motivated by the striking ability of transformers for in-context learnin...
In this paper, we explore the structure of the penultimate Gram matrix i...
Particle gradient descent, which uses particles to represent a probabili...
To understand the essential role of depth in neural networks, we investi...
How to recover a probability measure with sparse support from particular...
This paper underlines a subtle property of batch-normalization (BN):
Suc...
Viewing optimization methods as numerical integrators for ordinary
diffe...
We study the mixing properties for stochastic accelerated gradient desce...
Normalization techniques such as Batch Normalization have been applied v...
Gradient-based optimization methods are the most popular choice for find...
We analyze the variance of stochastic gradients along negative curvature...
Information spreads across social and technological networks, but often ...