BMD-SRA: A Boosting Model for Differentiating Sequence Read Archive sequences
Authors/Creators
Contributors
Data collector:
Hosting institution:
Project leader:
Researcher:
Supervisor:
- 1. Helmholtz Centre for Environmental Research
Description
The number of sequence files deposited in the Sequence Read Archive (NCBI-SRA) has been growing exponentially through the years, and with it, the number of incorrectly annotated types of sequences. The submitted sequences are then used for genomic, metagenomic, and taxonomic studies. This presents a need in the research community for a model that facilitates the collection of correctly annotated data. This study aimed to develop a boosting classification model called BMDSRA that classifies input sequences into four sequence types: 1)Metagenomes, 2)Amplicons, 3)Single-Amplified Genomes (SAGs), 4)Isolated-Genomes.
For developing the Machine Learning (ML) algorithm, we gathered 3000 test samples for each sequence type respectively. Test samples were used for supervised ML. Metagenomes were collected from various metagenome databases (DBs) (Kasmanas et al., Nucleic Acids Research, 2020) (750 samples from each), manually curated, and created by our team. Amplicon samples were gathered from the Joint Genome Institute portal based on their library strategy. The SAG samples were collected by manually inspecting published research papers, proving they were sequenced from a single cell. The Isolated-Genomes were gathered from SRA, searching for bacteria-type strain Genomes from different taxonomies. The BDMSRA reads a small portion of the sequence file using a sub-sampling approach (SRA Toolkit Development Team, https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software) and extracts statistical features generated based on Shannon entropy, Tsallis entropy, and Fourier z-curve. The extracted features were evaluated using the QPFS method (Soheili et al., Scientific Programming, 2020), and the reliability of training data was tested with an outlier analysis.
From the 119 generated features, we chose 38 with the highest importance for developing the model. The outlier analysis showed that the SAG and Amplicon data sets were the most reliable, with few outliers. The outliers from Metagenomes and Isolated-Genomes were subjected to further manual investigation. The model was created and evaluated by using 5-fold cross-validation. The confusion matrix showed an overall accuracy of 92% (96% for SAGs, 95% for Amplicons, 92% for Metagenomes, and 85% for Isolated-Genomes). The false negatives from Isolated-Genomes classified as Metagenome (7.6%) and SAGs (5.9 %) are likely due to the wrong classification in the SRA. The false negatives from Metagenomes classified as Isolated-Genomes (6.7%) are potentially due to downloading process from our Dbs.
BMDSRA can help researchers verify that the sequences they submit or collect from public repositories are correctly annotated. Further, our tool could also select samples for metastudies and determine if sequence projects are well performed.
Files
01_Bole_BAGECO_2023_poster_V8.pdf
Files
(1.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:d679862351b9331a3d9e9b04de415e0a
|
1.3 MB | Preview Download |
Additional details
Funding
- Deutsche Forschungsgemeinschaft
- NFDI4Microbiota 460129525
References
- Alneberg, J., Karlsson, C. M. G., Divne, A.-M., Bergin, C., Homa, F., Lindh, M. V., Hugerth, L. W., Ettema, T. J. G., Bertilsson, S., Andersson, A. F., and Pinhassi, J. (2018). Genomes from uncultivated prokaryotes: a comparison of metagenome-assembled and single-amplified genomes. Microbiome, 6(1):173.
- For Biotechnology Information (U.S.), N. C. SRA hand-book. National Center for Biotechnology Information, Bethesda.
- Hosokawa, M., Endoh, T., Kamata, K., Arikawa, K., Nishikawa, Y., Kogawa, M., Saeki, T., Yoda, T., and Takeyama, H. (2022). Strain-level profiling of viable microbial community by selective single-cell genome sequencing. Sci Rep, 12(1):4443.
- Soheili, M., Moghadam, A.-M. E., and Dehghan, M. (2020). Statistical analysis of the performance of rank fusion methods applied to a homogeneous ensemble feature ranking. Scientific Programming, 2020:8860044.
- Torres, P. J., Edwards, R. A., and McNair, K. A. (2017). PARTIE: a partition engine to separate metagenomic and amplicon projects in the sequence read archive. Bioinformatics, 33(15):2389--2391.