Published December 1, 2021 | Version v1.0.0
Other Open

Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection

  • 1. University of Liège
  • 2. Flemish Research Institute for Agriculture, Fisheries and Food (ILVO)
  • 3. University of California, Davis
  • 4. University of Sütçü Imam
  • 5. University of Bordeaux
  • 6. Institute for Sustainable Plant Protection
  • 7. Leibniz Institute DSMZ – German Collection of Microorganisms and Cell Cultur
  • 8. Virology, Agroscope
  • 9. National Institute of Biology

Description

In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. 

Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a “real” HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome).

Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset.

This entry is release v1.0.0 of a GitLab repository (https://gitlab.com/ilvo/VIROMOCKchallenge) which provides a complete description of the composition of each dataset, the methods used to create them and their goals.

Notes

This work reports the results of the Plant Health Bioinformatics Network (PHBN) Euphresco project (European Phytosanitary Research Coordination), funded by Special Research Funds (FSR) of Liège University (byPOP project), and the Belgian Federal Government (FPS Health project RI 18/A-289 PHBN).

Files

VIROMOCKchallenge-v1.0.0.zip

Files (36.9 MB)

Name Size Download all
md5:3e0243f916b103a6d65ba2db9afa9fa5
36.9 MB Preview Download

Additional details

Related works

Is supplement to
Preprint: 10.5281/zenodo.4584718 (DOI)
Is supplemented by
Journal article: 10.24072/pci.genomics.100007 (DOI)
Other: 10.5281/zenodo.4584967 (DOI)
Dataset: 10.5281/zenodo.5572591 (DOI)
Journal article: 10.24072/pcjournal.62 (DOI)