Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

07/16/2020
by   Seth Pettie, et al.
0

In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error in the Random Oracle model. We define a new measure of efficiency for cardinality estimators called the Fisher-Shannon (𝖥𝗂𝗌𝗁) number ℋ/ℐ. It captures the tension between the limiting Shannon entropy (ℋ) of the sketch and its normalized Fisher information (ℐ), which characterizes (asymptotically) the variance of a statistically efficient estimator. We prove that many variants of the 𝖯𝖢𝖲𝖠 sketch of Flajolet and Martin have 𝖥𝗂𝗌𝗁 number H_0/I_0, where H_0,I_0 are two precisely-defined constants, and that all base-q generalizations of (Hyper)LogLog are strictly worse than H_0/I_0, but tend to H_0/I_0 in the limit as q→∞. All other known sketches have even worse 𝖥𝗂𝗌𝗁-numbers. We introduce a new sketch called 𝖥𝗂𝗌𝗁𝗆𝗈𝗇𝗀𝖾𝗋 that is based on a smoothed, compressed version of 𝖯𝖢𝖲𝖠 with a different estimation function. 𝖥𝗂𝗌𝗁𝗆𝗈𝗇𝗀𝖾𝗋 has 𝖥𝗂𝗌𝗁 number H_0/I_0 ≈ 1.98. It stores O(log^2log U) + (H_0/I_0)b ≈ 1.98b bits, and estimates cardinalities of multisets of [U] with a standard error of (1+o(1))/√(b). 𝖥𝗂𝗌𝗁𝗆𝗈𝗇𝗀𝖾𝗋's space-error tradeoff improves on state-of-the-art sketches like HyperLogLog, or even compressed representations of it. 𝖥𝗂𝗌𝗁𝗆𝗈𝗇𝗀𝖾𝗋 can be used in a distributed environment (where substreams are sketched separately and composed later). We conjecture that the 𝖥𝗂𝗌𝗁-number H_0/I_0 is a universal lower bound for any such composable sketch.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro