Evaluating Probabilistic Inference in Deep Learning: Beyond Marginal Predictions
A fundamental challenge for any intelligent system is prediction: given some inputs X_1,..,X_τ can you predict outcomes Y_1,.., Y_τ. The KL divergence 𝐝_KL provides a natural measure of prediction quality, but the majority of deep learning research looks only at the marginal predictions per input X_t. In this technical report we propose a scoring rule 𝐝_KL^τ, parameterized by τ∈𝒩 that evaluates the joint predictions at τ inputs simultaneously. We show that the commonly-used τ=1 can be insufficient to drive good decisions in many settings of interest. We also show that, as τ grows, performing well according to 𝐝_KL^τ recovers universal guarantees for any possible decision. Finally, we provide problem-dependent guidance on the scale of τ for which our score provides sufficient guarantees for good performance.
READ FULL TEXT