Context binning, model clustering and adaptivity for data compression of genetic data

01/13/2022
by   Jarek Duda, et al.
0

Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations e.g. of Markov-like models, especially context binning and model clustering. While it is popular to cut low bits of context, proposed context binning optimizes such reduction as tabled: state=bin[context] determining probability distribution, this way extracting nearly all useful information also from very large contexts, into a small number of states. Model clustering uses k-means clustering in space of general statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity. This article is work in progress, to be expanded in the future.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset