Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal
Given a collection of n points in ℝ^d, the goal of the (k,z)-clustering problem is to find a subset of k "centers" that minimizes the sum of the z-th powers of the Euclidean distance of each point to the closest center. Special cases of the (k,z)-clustering problem include the k-median and k-means problems. Our main result is a unified two-stage importance sampling framework that constructs an ε-coreset for the (k,z)-clustering problem. Compared to the results for (k,z)-clustering in [Feldman and Langberg, STOC 2011], our framework saves a ε^2 d factor in the coreset size. Compared to the results for (k,z)-clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a poly(k) factor in the coreset size and avoids the (k/ε) term in the construction time. Specifically, our coreset for k-median (z=1) has size Õ(ε^-4 k) which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a k factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of Ω(k·min{2^z/20,d }) for a 0.01-coreset for (k,z)-clustering, which has a linear dependence of size on k and an exponential dependence on z that matches our algorithmic results.
READ FULL TEXT