PtLnc-BXE: Prediction of plant lncRNAs using a Bagging-XGBoost-ensemble method with multiple features
Motivation: Long non-coding RNAs (lncRNAs) are a diverse class of RNA molecules with a length above 200 nucleotides that do not encode proteins. Since lncRNAs have involved in a wide range of functions in cellular and developmental processes, an increasing number of methods or tools for distin-guishing lncRNAs from coding RNAs have been proposed. However, most of the existing methods are designed for lncRNAs in animal systems, and only a few methods focus on the plant lncRNA identifica-tion. Different from lncRNAs in animal systems, plant lncRNAs have distinct characteristics. It is desira-ble to develop a computational method for accurate and rapid identification of plant lncRNAs. Results: Herein, we present a plant lncRNA prediction approach PtLnc-BXE, which combines multiple sequence features in two steps to develop an ensemble mode. First, a diverse number of plants lncRNA features are collected and filtered by feature selection and subsequently used to represent RNA se-quences. Then, the training dataset is sampled into several subsets using the bootstrapping technique, and base learners are constructed on data subsets by using XGBoost, and multiple base learners are further combined into a single meta-learner by using logistic regression. PtLnc-BXE outperformed other state-of-the-art plant lncRNA prediction methods, achieving higher AUC (> 95.9 reveal that the different species have a high overlap between their selected features for modeling. Therefore, it is possible to build the cross-species predic-tion models for plant lncRNAs. Availability: The scripts and data can be downloaded at https://github.com/xxxxx Contact: example@example.org Supplementary information: Supplementary data are available at Bioinformatics online.
READ FULL TEXT