Unifying the Dropout Family Through Structured Shrinkage Priors
Dropout regularization of deep neural networks has been a mysterious yet effective tool to prevent overfitting. Explanations for its success range from the prevention of "co-adapted" weights to it being a form of cheap Bayesian inference. We propose a novel framework for understanding multiplicative noise in neural networks, considering continuous distributions as well as Bernoulli (i.e. dropout). We show that multiplicative noise induces structured shrinkage priors on a network's weights. We derive the equivalence through reparametrization properties of scale mixtures and not via any approximation. Given the equivalence, we then show that dropout's usual Monte Carlo training objective approximates marginal MAP estimation. We analyze this MAP objective under strong shrinkage, showing the expanded parametrization (i.e. likelihood noise) is more stable than a hierarchical representation. Lastly, we derive analogous priors for ResNets, RNNs, and CNNs and reveal their equivalent implementation as noise.
READ FULL TEXT