Estimation of Squared-Loss Mutual Information from Positive and Unlabeled Data
Capturing input-output dependency is an important task in statistical data analysis. Mutual information (MI) is a vital tool for this purpose, but it is known to be sensitive to outliers. To cope with this problem, a squared-loss variant of MI (SMI) was proposed, and its supervised estimator has been developed. On the other hand, in real-world classification problems, it is conceivable that only positive and unlabeled (PU) data are available. In this paper, we propose a novel estimator of SMI only from PU data, and prove its optimal convergence to true SMI. Based on the PU-SMI estimator, we further propose a dimension reduction method which can be executed without estimating the class-prior probabilities of unlabeled data. Such PU class-prior estimation is often required in PU classification algorithms, but it is unreliable particularly in high-dimensional problems, yielding a biased classifier. Our dimension reduction method significantly boosts the accuracy of PU class-prior estimation, as demonstrated through experiments. We also develop a method of independent testing based on our PU-SMI estimator and experimentally show its superiority.
READ FULL TEXT