CDF Transform-Shift: An effective way to deal with inhomogeneous density datasets

10/05/2018
by   Ye Zhu, et al.
0

Many distance-based algorithms exhibit bias towards dense clusters in inhomogeneous datasets (i.e., those which contain clusters in both dense and sparse regions of the space). For example, density-based clustering algorithms tend to join neighbouring dense clusters together into a single group in the presence of a sparse cluster; while distance-based anomaly detectors exhibit difficulty in detecting local anomalies which are close to a dense cluster in datasets also containing sparse clusters. In this paper, we propose the CDF Transform-Shift (CDF-TS) algorithm which is based on a multi-dimensional Cumulative Distribution Function (CDF) transformation. It effectively converts a dataset with clusters of inhomogeneous density to one with clusters of homogeneous density, i.e., the data distribution is converted to one in which all locally low/high-density locations become globally low/high-density locations. Thus, after performing the proposed Transform-Shift, a single global density threshold can be used to separate the data into clusters and their surrounding noise points. Our empirical evaluations show that CDF-TS overcomes the shortcomings of existing density-based clustering and distance-based anomaly detection algorithms and significantly improves their performance.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset