On Efficient Low Distortion Ultrametric Embedding

08/15/2020
by   Vincent Cohen-Addad, et al.
0

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of n points in Ω(log n) dimensions exhibit a quite prohibitive running time of Θ(n^2). In this paper, we provide a new algorithm which takes as input a set of points P in ℝ^d, and for every c≥ 1, runs in time n^1+ρ/c^2 (for some universal constant ρ>1) to output an ultrametric Δ such that for any two points u,v in P, we have Δ(u,v) is within a multiplicative factor of 5c to the distance between u and v in the "best" ultrametric representation of P. Here, the best ultrametric is the ultrametric Δ̃ that minimizes the maximum distance distortion with respect to the ℓ_2 distance, namely that minimizes u,v ∈ Pmax Δ̃(u,v)/u-v_2. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant ε>0, no algorithm with running time n^2-ε can distinguish between inputs in ℓ_∞-metric that admit isometric embedding and those that incur a distortion of 3/2. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset