Representative Selection for Big Data via Sparse Graph and Geodesic Grassmann Manifold Distance
This paper addresses the problem of identifying a very small subset of data points that belong to a significantly larger massive dataset (i.e., Big Data). The small number of selected data points must adequately represent and faithfully characterize the massive Big Data. Such identification process is known as representative selection [19]. We propose a novel representative selection framework by generating an l1 norm sparse graph for a given Big-Data dataset. The Big Data is partitioned recursively into clusters using a spectral clustering algorithm on the generated sparse graph. We consider each cluster as one point in a Grassmann manifold, and measure the geodesic distance among these points. The distances are further analyzed using a min-max algorithm [1] to extract an optimal subset of clusters. Finally, by considering a sparse subgraph of each selected cluster, we detect a representative using principal component centrality [11]. We refer to the proposed representative selection framework as a Sparse Graph and Grassmann Manifold (SGGM) based approach. To validate the proposed SGGM framework, we apply it onto the problem of video summarization where only few video frames, known as key frames, are selected among a much longer video sequence. A comparison of the results obtained by the proposed algorithm with the ground truth, which is agreed by multiple human judges, and with some state-of-the-art methods clearly indicates the viability of the SGGM framework.
READ FULL TEXT