Covering the Relational Join
In this paper, we initiate a theoretical study of what we call the join covering problem. We are given a natural join query instance Q on n attributes and m relations (R_i)_i ∈ [m]. Let J_Q = _i=1^m R_i denote the join output of Q. In addition to Q, we are given a parameter Δ: 1≤Δ≤ n and our goal is to compute the smallest subset 𝒯_Q, Δ⊆ J_Q such that every tuple in J_Q is within Hamming distance Δ - 1 from some tuple in 𝒯_Q, Δ. The join covering problem captures both computing the natural join from database theory and constructing a covering code with covering radius Δ - 1 from coding theory, as special cases. We consider the combinatorial version of the join covering problem, where our goal is to determine the worst-case |𝒯_Q, Δ| in terms of the structure of Q and value of Δ. One obvious approach to upper bound |𝒯_Q, Δ| is to exploit a distance property (of Hamming distance) from coding theory and combine it with the worst-case bounds on output size of natural joins (AGM bound hereon) due to Atserias, Grohe and Marx [SIAM J. of Computing'13]. Somewhat surprisingly, this approach is not tight even for the case when the input relations have arity at most two. Instead, we show that using the polymatroid degree-based bound of Abo Khamis, Ngo and Suciu [PODS'17] in place of the AGM bound gives us a tight bound (up to constant factors) on the |𝒯_Q, Δ| for the arity two case. We prove lower bounds for |𝒯_Q, Δ| using well-known classes of error-correcting codes e.g, Reed-Solomon codes. We can extend our results for the arity two case to general arity with a polynomial gap between our upper and lower bounds.
READ FULL TEXT