HyperMinHash: MinHash in LogLog space
In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard n-space MinHash by building off of a HyperLogLog count-distinct sketch. For a multiplicative approximation error 1+ ϵ on a Jaccard index t , given a random oracle, HyperMinHash needs O(ϵ^-2( n + 1/ t ϵ)) space. Unlike comparable Jaccard index fingerprinting algorithms (such as b-bit MinHash, which uses less space), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. Our new algorithm allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 10^19 with relative error of around 10% using 64KiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 10^10 with the same memory consumption. Note that we will operate in the unbounded data stream model and assume both a random oracle and shared randomness.
READ FULL TEXT