Training Mixture Models at Scale via Coresets
How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can compute a (1+ ε)-approximation for the optimal model on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new complexity results for mixtures of Gaussians. As a by-product of our analysis, we prove that the pseudo-dimension of arbitrary mixtures of Gaussians is polynomial in the ambient dimension. Empirical evaluation on several real-world datasets suggest that our coreset-based approach enables significant reduction in training-time with negligible approximation error.
READ FULL TEXT