Recent advances in machine translation (MT) have shown that Minimum Baye...
In this work, we provide a large-scale empirical study of the scaling
pr...
The rapid scaling of language models is motivating research using
low-bi...
Recent research has proposed a series of specialized optimization algori...
Very little is known about the training dynamics of adaptive gradient me...
In this work, we study the effect of varying the architecture and traini...
Natural language understanding and generation models follow one of the t...
In this work, we study the evolution of the loss Hessian across many
cla...
We present an empirical study of scaling properties of encoder-decoder
T...
For a certain scaling of the initialization of stochastic gradient desce...
We study the supervised learning problem under either of the following t...
We consider the problem of learning an unknown function f_ on the
d-dime...
To understand the dynamics of optimization in deep neural networks, we
d...
We study estimation of the covariance matrix under relative condition nu...
Topic models are Bayesian models that are frequently used to capture the...