High-Performance Statistical Computing in the Computing Environments of the 2020s
Technological advances in the past decade, hardware and software alike, have made access to high-performance computing (HPC) easier than ever. We review these advances from a statistical computing perspective. Cloud computing allows access to supercomputers affordable. Deep learning software libraries make programming statistical algorithms easy, and enable users to write code once and run it anywhere from a laptop to a workstation with multiple graphics processing units (GPUs) or a supercomputer in a cloud. To promote statisticians to benefit from these developments, we review recent optimization algorithms that are useful for high-dimensional models and can harness the power of HPC. Code snippets are provided for the readers to grasp the ease of programming. We also provide an easy-to-use distributed matrix data structure suitable for HPC. Employing this data structure, we illustrate various statistical applications including large-scale nonnegative matrix factorization, positron emission tomography, multidimensional scaling, and ℓ_1-regularized Cox regression. Our examples easily scale up to an 8-GPU workstation and a 720-CPU-core cluster in a cloud. As a case in point, we analyze the on-set of type-2 diabetes from the UK Biobank with 200,000 subjects and about 500,000 single nucleotide polymorphisms using the HPC ℓ_1-regularized Cox regression. Fitting a half-million-variate model takes less than 45 minutes, reconfirming known associations. To our knowledge, the feasibility of jointly genome-wide association analysis of survival outcomes at this scale is first demonstrated.
READ FULL TEXT