Contrastive learning has gained significant attention as a method for
se...
Large language models like GPT-4 exhibit emergent capabilities across
ge...
Chain-of-thought (CoT) is a method that enables language models to handl...
In this paper, we propose MPC (Modular Prompted Chatbot), a new approach...
Recent research has shown that training low-rank neural networks can
eff...
Feature normalization transforms such as Batch and Layer-Normalization h...
We present a framework for using transformer networks as universal compu...
In-context learning (ICL) is a type of prompting where a transformer mod...
Weight decay is one of the most widely used forms of regularization in d...
Fine-tuning pretrained language models (LMs) without making any architec...
Word translation without parallel corpora has become feasible, rivaling ...
It has been widely observed that large neural networks can be pruned to ...
Mixup is a data augmentation method that generates new data points by mi...
A recent work by Ramanujan et al. (2020) provides significant empirical
...
It is well known that modern deep neural networks are powerful enough to...
To mitigate communication overheads in distributed model training, sever...
Rapid growth in data sets and the scale of neural network architectures ...
A recent line of ground-breaking results for permutation-based SGD has
c...
Distributed model training suffers from communication bottlenecks due to...
Due to its decentralized nature, Federated Learning (FL) lends itself to...
The strong lottery ticket hypothesis (LTH) postulates that one can
appro...
Stochastic gradient descent without replacement sampling is widely used ...
Federated learning allows edge devices to collaboratively learn a shared...
To improve the resilience of distributed training to worst-case, or Byza...
Several recent works have aimed to explain why severely overparameterize...
Adversarial training is a technique for training robust machine learning...
Data augmentation (DA) is commonly used during model training, as it
sig...
Machine learning (ML) techniques are enjoying rapidly increasing adoptio...
We present ErasureHead, a new approach for distributed gradient descent ...
State-of-the-art machine learning models frequently misclassify inputs t...
Distributed model training suffers from communication overheads due to
f...
Distributed implementations of mini-batch stochastic gradient descent (S...
Gradient descent and its many variants, including mini-batch stochastic
...
Distributed model training is vulnerable to worst-case system failures a...
Distributed algorithms are often beset by the straggler effect, where th...
We establish novel generalization bounds for learning algorithms that
co...
We present CYCLADES, a general framework for parallelizing stochastic
op...
In Bipartite Correlation Clustering (BCC) we are given a complete bipart...
We consider the following multi-component sparse PCA problem: given a se...
We explain theoretically a curious empirical phenomenon: "Approximating ...