Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection

1. University of Liège
2. Flemish Research Institute for Agriculture, Fisheries and Food (ILVO)
3. University of California, Davis
4. University of Sütçü Imam
5. University of Bordeaux
6. Institute for Sustainable Plant Protection
7. Leibniz Institute DSMZ – German Collection of Microorganisms and Cell Cultur
8. Virology, Agroscope
9. National Institute of Biology

In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis.

Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a “real” HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome).

Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset.

This entry is release v1.0.0 of a GitLab repository (https://gitlab.com/ilvo/VIROMOCKchallenge) which provides a complete description of the composition of each dataset, the methods used to create them and their goals.

Notes

This work reports the results of the Plant Health Bioinformatics Network (PHBN) Euphresco project (European Phytosanitary Research Coordination), funded by Special Research Funds (FSR) of Liège University (byPOP project), and the Belgian Federal Government (FPS Health project RI 18/A-289 PHBN).

Files

VIROMOCKchallenge-v1.0.0.zip

Files (36.9 MB)

Name	Size	Download all
VIROMOCKchallenge-v1.0.0.zip md5:3e0243f916b103a6d65ba2db9afa9fa5	36.9 MB	Preview Download

Additional details

Is supplement to: Preprint: 10.5281/zenodo.4584718 (DOI)
Is supplemented by: Journal article: 10.24072/pci.genomics.100007 (DOI); Other: 10.5281/zenodo.4584967 (DOI); Dataset: 10.5281/zenodo.5572591 (DOI); Journal article: 10.24072/pcjournal.62 (DOI)

Views

Downloads

Show more details

	All versions	This version
Views	94	93
Downloads	19	19
Data volume	701.5 MB	701.5 MB

More info on how stats are collected....

DOI

Resource type

Other

Publisher

Zenodo

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: December 7, 2021
Modified: March 2, 2023

Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection

Creators

Description

Notes

Files

VIROMOCKchallenge-v1.0.0.zip

Files (36.9 MB)

Additional details

Related works