Moving towards practical user-friendly synthesis: Scalable synthetic data methods for large confidential administrative databases using saturated count models
Over the past three decades, synthetic data methods for statistical disclosure control have continually developed; methods have adapted to account for different data types, but mainly within the domain of survey data sets. Certain characteristics of administrative databases - sometimes just the sheer volume of records of which they are comprised - present challenges from a synthesis perspective and thus require special attention. This paper, through the fitting of saturated models, presents a way in which administrative databases can not only be synthesized quickly, but also allows risk and utility to be formalised in a manner inherently unfeasible in other techniques. The paper explores how the flexibility afforded by two-parameter count models (the negative binomial and Poisson-inverse Gaussian) can be utilised to protect respondents' - especially uniques' - privacy in synthetic data. Finally an empirical example is carried out through the synthesis of a database which can be viewed as a good representative to the English School Census.
READ FULL TEXT