Sparx: Distributed Outlier Detection at Scale
There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source Sparx under the Apache license at https://tinyurl.com/sparx2022.
READ FULL TEXT