Faster Projective Clustering Approximation of Big Data
In projective clustering we are given a set of n points in R^d and wish to cluster them to a set S of k linear subspaces in R^d according to some given distance function. An -coreset for this problem is a weighted (scaled) subset of the input points such that for every such possible S the sum of these distances is approximated up to a factor of (1+). We suggest to reduce the size of existing coresets by suggesting the first O(log(m)) approximation for the case of m lines clustering in O(ndm) time, compared to the existing (m) solution. We then project the points on these lines and prove that for a sufficiently large m we obtain a coreset for projective clustering. Our algorithm also generalize to handle outliers. Experimental results and open code are also provided.
READ FULL TEXT