Simulated NGS read datasets for novel human virus prediction
- 1. Hasso Plattner Institute
- 2. Free University of Berlin
Description
This repository contains simulated Illumina read datasets for novel human virus prediction and associated metadata extracted from the Virus Host Database (https://www.genome.jp/virushostdb/). The reads are 250bp long and were simulated with Mason (https://www.seqan.de/apps/mason/) from genomes downloaded from NCBI. The training-validation-test split was done on whole viral sequences to ensure "novelty" of validation and test viruses. The training sets contain 10 million reads per class, validation sets - 1.25 million reads per class, and test sets - 1.25 million paired reads per class. The negative class sets contain reads simulated from chordate-infecting ("cho"), metazoan-infecting ("met"), eukariote-infecting ("euk") and all-nonhuman viruses. The positive class contains human-infecting viruses. The stratified dataset ("strat") contains an equal number of reads from "cho", "met but not cho", "euk but not met" and "all but not euk".
Species-level datasets ("humspec", "allspec" and "chospec", with the corresponding fasta and *_species.rds files) are constructed analogously, but ensuring that all viruses of a given species were assigned to either training, val or test set. This is a stricter setting modelling a "novel viral species" scenario while reflecting within-species phenotype diversity.
blast_hits.gz contains blast hits of human virome reads form Moustafa et al., 2017 (https://doi.org/10.1371/journal.ppat.1006292) blasted against our training database (see paper for details). In the second column you can find the matched label and the accession number of the matched reference. blast_labels_complete.gz contains extracted labels for all virome reads, including those without any matches. Note: one of the read headers (>3c8ac47039d32b11c8fe23f588e444e9) from Moustafa et al. is slightly corrupted with null characters. You can remove them with sed 's/\x0//g' or equivalent.
Files
Files
(6.9 GB)
Name | Size | Download all |
---|---|---|
md5:e6b4c304efca442a6f73f7dfe47d2387
|
509.5 MB | Download |
md5:5551549d66d846c33d1de30e77f58052
|
279.7 MB | Download |
md5:51ad807ebf5d17fce1abd7adf25be26f
|
45.2 MB | Download |
md5:956c25a0640436f31f31ebf9703367f9
|
45.2 MB | Download |
md5:3da6c73b96dce6cef9638be3b9502ea0
|
724.3 MB | Download |
md5:4d203398fc16169de8dd659cc1c6a745
|
90.6 MB | Download |
md5:2bdbacfd8e9b580c1aa89b8533cb9327
|
39.2 MB | Download |
md5:f29610e8e1fbaad69feb17b569f6e056
|
39.2 MB | Download |
md5:87bd98bae66d87dd75a0606f46d9e83e
|
677.3 MB | Download |
md5:32e6ebd3eaa6b73f31d4228e227bf135
|
85.3 MB | Download |
md5:861d93fb6a647f2b9a82a43baccaf464
|
46.4 MB | Download |
md5:b0021ff1f483d7ac20adc58cbd4251f4
|
46.4 MB | Download |
md5:be496655e3369507056527e0460bbb93
|
42.3 MB | Download |
md5:1b7a7877c28236e4efa52ba2eaa06607
|
42.3 MB | Download |
md5:b5be592ba56bceb295249719bb552540
|
43.2 MB | Download |
md5:bd8654bd51722d123991c7d22d4a2aa2
|
43.2 MB | Download |
md5:bac1f86746fac4cfe851d09cf59ab323
|
43.3 MB | Download |
md5:09fef8a7f00f82191579b7fab667dc31
|
43.3 MB | Download |
md5:83e7288b1d8edd8407ad6e49e68f6626
|
44.6 MB | Download |
md5:39e9c26840aa4e554a0a645bac7e004f
|
44.6 MB | Download |
md5:0a472ba2a58677670c4d17ea404ce73e
|
725.4 MB | Download |
md5:91b78b8c971e1cf88da0291161b45932
|
679.7 MB | Download |
md5:a82c115d3a091d1bc72d9a87db1f4564
|
707.2 MB | Download |
md5:c74933750380f0d0c8e388c6caa54254
|
91.6 MB | Download |
md5:49550fa742c3b8896bd153fa61303450
|
86.3 MB | Download |
md5:e85d57352c4adf24b7daeec470341b32
|
89.1 MB | Download |
md5:bd382f5dd92edfad5aea95c3f13e6578
|
38.8 MB | Download |
md5:11839cf11f00ef2545239c5c688a912c
|
38.8 MB | Download |
md5:bc2d1002cbbddb52b3a8061ec20fbcf7
|
607.6 MB | Download |
md5:259023a466ca04da1f494c3169eafaaa
|
70.1 MB | Download |
md5:e8e7eaea021d77d2964c8bdc6520a807
|
39.2 MB | Download |
md5:597d7c290c2acff787810309b7e60393
|
39.2 MB | Download |
md5:ba8468b101a92b725df1c54a22e38b27
|
606.0 MB | Download |
md5:e8ea78f44db77fc3d55640c8e0a9b8b8
|
80.4 MB | Download |
md5:6712dee8dc1a7a846938a0e8d223e474
|
520.3 kB | Download |
md5:560a3a1411bd2ad80c4f13aa76c27d7e
|
472.5 kB | Download |
md5:79178e6444ea4ab27c7660e0a98c0a22
|
372.7 kB | Download |
md5:116ac740422a661569310565cc281e68
|
295.3 kB | Download |
md5:54ec3b4e5e1c8861adc52365f11ff52f
|
454.4 kB | Download |
md5:03879fe3f7f0c4daf274ffb4b0643d28
|
346.4 kB | Download |
md5:bd78ab2258d1e782934d8998da56f10d
|
258.6 kB | Download |
md5:655eeb5c7de9edc819bb90e61366aa88
|
396.9 kB | Download |