Near-Optimal Model Discrimination with Non-Disclosure

12/04/2020
by   Dmitrii M. Ostrovskii, et al.
0

Let θ_0,θ_1 ∈ℝ^d be the population risk minimizers associated to some loss ℓ: ℝ^d ×𝒵→ℝ and two distributions ℙ_0,ℙ_1 on 𝒵. We pose the following question: Given i.i.d. samples from ℙ_0 and ℙ_1, what sample sizes are sufficient and necessary to distinguish between the two hypotheses θ^* = θ_0 and θ^* = θ_1 for given θ^* ∈{θ_0, θ_1}? Making the first steps towards answering this question in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide matching upper and lower bounds on the sample complexity, showing it to be min{1/Δ^2, √(r)/Δ} up to a constant factor, where Δ is a measure of separation between ℙ_0 and ℙ_1, and r is the rank of the design covariance matrix. This bound is dimension-independent, and rank-independent for large enough separation. We then extend this result in two directions: (i) for the general parametric setup in asymptotic regime; (ii) for generalized linear models in the small-sample regime n ≤ r and under weak moment assumptions. In both cases, we derive sample complexity bounds of a similar form, even under misspecification. Our testing procedures only access θ^* through a certain functional of empirical risk. In addition, the number of observations that allows to reach statistical confidence in our tests does not allow to "resolve" the two models – that is, recover θ_0,θ_1 up to O(Δ) prediction accuracy. These two properties allow to apply our framework in applied tasks where one would like to identify a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually inferred by the identifying agent.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset