Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon
In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error in the Random Oracle model. We define a new measure of efficiency for cardinality estimators called the Fisher-Shannon (𝖥𝗂𝗌𝗁) number ℋ/ℐ. It captures the tension between the limiting Shannon entropy (ℋ) of the sketch and its normalized Fisher information (ℐ), which characterizes (asymptotically) the variance of a statistically efficient estimator. We prove that many variants of the 𝖯𝖢𝖲𝖠 sketch of Flajolet and Martin have 𝖥𝗂𝗌𝗁 number H_0/I_0, where H_0,I_0 are two precisely-defined constants, and that all base-q generalizations of (Hyper)LogLog are strictly worse than H_0/I_0, but tend to H_0/I_0 in the limit as q→∞. All other known sketches have even worse 𝖥𝗂𝗌𝗁-numbers. We introduce a new sketch called 𝖥𝗂𝗌𝗁𝗆𝗈𝗇𝗀𝖾𝗋 that is based on a smoothed, compressed version of 𝖯𝖢𝖲𝖠 with a different estimation function. 𝖥𝗂𝗌𝗁𝗆𝗈𝗇𝗀𝖾𝗋 has 𝖥𝗂𝗌𝗁 number H_0/I_0 ≈ 1.98. It stores O(log^2log U) + (H_0/I_0)b ≈ 1.98b bits, and estimates cardinalities of multisets of [U] with a standard error of (1+o(1))/√(b). 𝖥𝗂𝗌𝗁𝗆𝗈𝗇𝗀𝖾𝗋's space-error tradeoff improves on state-of-the-art sketches like HyperLogLog, or even compressed representations of it. 𝖥𝗂𝗌𝗁𝗆𝗈𝗇𝗀𝖾𝗋 can be used in a distributed environment (where substreams are sketched separately and composed later). We conjecture that the 𝖥𝗂𝗌𝗁-number H_0/I_0 is a universal lower bound for any such composable sketch.
READ FULL TEXT