Sequential algorithms for testing identity and closeness of distributions
What advantage do sequential procedures provide over batch algorithms for testing properties of unknown distributions? Focusing on the problem of testing whether two distributions ð_1 and ð_2 on {1,âĶ, n} are equal or Ïĩ-far, we give several answers to this question. We show that for a small alphabet size n, there is a sequential algorithm that outperforms any batch algorithm by a factor of at least 4 in terms sample complexity. For a general alphabet size n, we give a sequential algorithm that uses no more samples than its batch counterpart, and possibly fewer if the actual distance TV(ð_1, ð_2) between ð_1 and ð_2 is larger than Ïĩ. As a corollary, letting Ïĩ go to 0, we obtain a sequential algorithm for testing closeness when no a priori bound on TV(ð_1, ð_2) is given that has a sample complexity ðŠĖ(n^2/3/TV(ð_1, ð_2)^4/3): this improves over the ðŠĖ(n/log n/TV(ð_1, ð_2)^2) tester of <cit.> and is optimal up to multiplicative constants. We also establish limitations of sequential algorithms for the problem of testing identity and closeness: they can improve the worst case number of samples by at most a constant factor.
READ FULL TEXT