The Chi-Square Test of Distance Correlation
Distance correlation has gained much recent attention in the statistics and machine learning community: the sample statistic is straightforward to compute, works for any metric or kernel choice, and equals 0 asymptotically if and only if independence. One major bottleneck is the testing process: the null distribution of distance correlation depends on the metric choice and marginal distributions, which cannot be readily estimated. To compute a p-value, the standard approach is to estimate the null distribution via permutation, which generally requires O(rn^2) time complexity for n samples and r permutations and too costly for big data applications. In this paper, we propose a chi-square distribution to approximate the null distribution of the unbiased distance correlation. We prove that the chi-square distribution either equals or well-approximates the null distribution, and always upper tail dominates the null distribution. The resulting distance correlation chi-square test does not require any permutation nor parameter estimation, is simple and fast to implement, works with any strong negative type metric or characteristic kernel, is valid and universally consistent for independence testing, and enjoys a similar finite-sample testing power as the standard permutation test. When testing one-dimensional data using Euclidean distance, the unbiased distance correlation testing runs in O(nlog(n)), rendering it comparable in speed to the Pearson correlation t-test. The results are supported and demonstrated via simulations.
READ FULL TEXT