Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published January 31, 2019 | Version v1.0
Dataset Open

Simulated NGS read datasets for bacterial pathogenic potential prediction

  • 1. Robert Koch Institute
  • 2. Free University of Berlin

Description

## Predicting pathogenic potentials from NGS reads: novel bacterial species

This repository contains simulated Illumina read datasets for bacterial pathogenic potential prediction and associated metadata extracted from the IMG Database (https://img.jgi.doe.gov/). The reads are 250bp long and were simulated with Mason (https://www.seqan.de/apps/mason/) from genomes downloaded from NCBI. The training-validation-test split was done on the species level to ensure "novelty" of validation and test species. The training sets contain 10 million reads per class, validation sets - 1.25 million reads per class, and test sets - 1.25 million paired reads per class. Additional, imbalanced training sets contain 2.5 million "nonpathogenic" and 17.5 million "pathogenic" reads, keeping the mean covarage constant for all species. The temporal benchmark test set contains reads from 3 additional pathogenic species in the Pantoea genus.

## Predicting pathogenic potentials from NGS reads: novel strains of known species

The BacPaCS datasets contain reads simulated from the dataset compiled by Barash et al. (https://doi.org/10.1093/bioinformatics/bty928). It this case, the training-validation-test split was done on the strain level (so different strains of the same species may be present in all three sets).

Files

Files (5.6 GB)

Name Size Download all
md5:30d8b2ad77930e2f62e0426ce5b85e79
144.9 kB Download
md5:b5b0ef0af505bbe8eed15ee4539068e3
4.7 MB Download
md5:e3fb632b8e3e68a8682f8dbb326004c2
3.1 MB Download
md5:bce2cc1dc464a5b27eed18de3fddad85
35.9 MB Download
md5:6e691a2c9023001dfc654ea214db6f56
35.9 MB Download
md5:de6ba1ef521122ffe2ae39a176d134a7
805.6 MB Download
md5:b7f9a761287e3dd1ca2414adebff2901
100.8 MB Download
md5:e34306ddb3b693eff59448f6d8f7fab5
50.5 MB Download
md5:d8e7d711da63f17a60bb7e543cf8e53a
50.5 MB Download
md5:78596e6a34ec266f1e86804f9e44f9c2
808.2 MB Download
md5:1a4b8ef54d6ca6604cb5ef104c95fbbf
202.8 MB Download
md5:696379f88117ddd2f4512d078abde17f
101.3 MB Download
md5:0f390d097c9299bd9031c58c3c99b5be
63.6 MB Download
md5:b8ed47e64ec332e68ed3c3cc6251f546
63.6 MB Download
md5:96ba33204e1fe75c3fdf8910502aee9a
804.4 MB Download
md5:ff3a46054bdb761eda395b40178207ef
100.5 MB Download
md5:b79795f8c27e01eeee04e0723f3c2a1c
50.3 MB Download
md5:f2af6eb49cffec82f8f426b6d7e478be
50.3 MB Download
md5:760306b929c00e9072da6e1f0d82df20
802.5 MB Download
md5:ecd05be4a0da2dbeb1cf3a642d52421f
1.4 GB Download
md5:6b2e5c71634bd48a4c607e0979c09b2b
100.8 MB Download
md5:e8dc8d64b1e28f679cd2d5e69311967a
4.0 MB Download
md5:8c218df6eb9766a0c6c28b42b7edd60e
4.0 MB Download

Additional details

Related works

Is supplement to
Journal article: 10.1093/bioinformatics/btz541 (DOI)

References

  • Barash, E. et al. (2019), BacPaCS—Bacterial Pathogenicity Classification via Sparse-SVM. Bioinformatics, 35(12), 2001–2008
  • Chen, I.-M. A. et al. (2019). IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucleic Acids Research, 47(D1), D666–D677.