Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection
Creators
- 1. University of Liège
- 2. Flemish Research Institute for Agriculture, Fisheries and Food (ILVO)
- 3. University of California, Davis
- 4. University of Sütçü Imam
- 5. University of Bordeaux
- 6. Institute for Sustainable Plant Protection
- 7. Leibniz Institute DSMZ – German Collection of Microorganisms and Cell Cultur
- 8. Virology, Agroscope
- 9. National Institute of Biology
Description
In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis.
Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a “real” HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome).
Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset.
This entry is release v1.0.0 of a GitLab repository (https://gitlab.com/ilvo/VIROMOCKchallenge) which provides a complete description of the composition of each dataset, the methods used to create them and their goals.
Notes
Files
VIROMOCKchallenge-v1.0.0.zip
Files
(36.9 MB)
Name | Size | Download all |
---|---|---|
md5:3e0243f916b103a6d65ba2db9afa9fa5
|
36.9 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Preprint: 10.5281/zenodo.4584718 (DOI)
- Is supplemented by
- Journal article: 10.24072/pci.genomics.100007 (DOI)
- Other: 10.5281/zenodo.4584967 (DOI)
- Dataset: 10.5281/zenodo.5572591 (DOI)
- Journal article: 10.24072/pcjournal.62 (DOI)