Simulated NGS read datasets for bacterial pathogenic potential prediction
- 1. Robert Koch Institute
- 2. Free University of Berlin
Description
## Predicting pathogenic potentials from NGS reads: novel bacterial species
This repository contains simulated Illumina read datasets for bacterial pathogenic potential prediction and associated metadata extracted from the IMG Database (https://img.jgi.doe.gov/). The reads are 250bp long and were simulated with Mason (https://www.seqan.de/apps/mason/) from genomes downloaded from NCBI. The training-validation-test split was done on the species level to ensure "novelty" of validation and test species. The training sets contain 10 million reads per class, validation sets - 1.25 million reads per class, and test sets - 1.25 million paired reads per class. Additional, imbalanced training sets contain 2.5 million "nonpathogenic" and 17.5 million "pathogenic" reads, keeping the mean covarage constant for all species. The temporal benchmark test set contains reads from 3 additional pathogenic species in the Pantoea genus.
## Predicting pathogenic potentials from NGS reads: novel strains of known species
The BacPaCS datasets contain reads simulated from the dataset compiled by Barash et al. (https://doi.org/10.1093/bioinformatics/bty928). It this case, the training-validation-test split was done on the strain level (so different strains of the same species may be present in all three sets).
Files
Files
(5.6 GB)
Name | Size | Download all |
---|---|---|
md5:30d8b2ad77930e2f62e0426ce5b85e79
|
144.9 kB | Download |
md5:b5b0ef0af505bbe8eed15ee4539068e3
|
4.7 MB | Download |
md5:e3fb632b8e3e68a8682f8dbb326004c2
|
3.1 MB | Download |
md5:bce2cc1dc464a5b27eed18de3fddad85
|
35.9 MB | Download |
md5:6e691a2c9023001dfc654ea214db6f56
|
35.9 MB | Download |
md5:de6ba1ef521122ffe2ae39a176d134a7
|
805.6 MB | Download |
md5:b7f9a761287e3dd1ca2414adebff2901
|
100.8 MB | Download |
md5:e34306ddb3b693eff59448f6d8f7fab5
|
50.5 MB | Download |
md5:d8e7d711da63f17a60bb7e543cf8e53a
|
50.5 MB | Download |
md5:78596e6a34ec266f1e86804f9e44f9c2
|
808.2 MB | Download |
md5:1a4b8ef54d6ca6604cb5ef104c95fbbf
|
202.8 MB | Download |
md5:696379f88117ddd2f4512d078abde17f
|
101.3 MB | Download |
md5:0f390d097c9299bd9031c58c3c99b5be
|
63.6 MB | Download |
md5:b8ed47e64ec332e68ed3c3cc6251f546
|
63.6 MB | Download |
md5:96ba33204e1fe75c3fdf8910502aee9a
|
804.4 MB | Download |
md5:ff3a46054bdb761eda395b40178207ef
|
100.5 MB | Download |
md5:b79795f8c27e01eeee04e0723f3c2a1c
|
50.3 MB | Download |
md5:f2af6eb49cffec82f8f426b6d7e478be
|
50.3 MB | Download |
md5:760306b929c00e9072da6e1f0d82df20
|
802.5 MB | Download |
md5:ecd05be4a0da2dbeb1cf3a642d52421f
|
1.4 GB | Download |
md5:6b2e5c71634bd48a4c607e0979c09b2b
|
100.8 MB | Download |
md5:e8dc8d64b1e28f679cd2d5e69311967a
|
4.0 MB | Download |
md5:8c218df6eb9766a0c6c28b42b7edd60e
|
4.0 MB | Download |
Additional details
Related works
- Is supplement to
- Journal article: 10.1093/bioinformatics/btz541 (DOI)
References
- Barash, E. et al. (2019), BacPaCS—Bacterial Pathogenicity Classification via Sparse-SVM. Bioinformatics, 35(12), 2001–2008
- Chen, I.-M. A. et al. (2019). IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucleic Acids Research, 47(D1), D666–D677.