PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming
Data cleaning can be naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered and corrupted to yield incomplete, dirty, and denormalized datasets. Based on this view, we present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis. We show empirically that short (< 50-line) PClean programs can be faster and more accurate than generic PPL inference on multiple data-cleaning benchmarks; perform comparably in terms of accuracy and runtime to state-of-the-art data-cleaning systems (unlike generic PPL inference given the same runtime); and scale to real-world datasets with millions of records.
READ FULL TEXT