Do Public Databases Need Higher Standards for Next-Generation Data Submissions?

Published May 16, 2023 | Version v2023-05-16

Poster Open

Genomics, an extension of Genetics, is a powerful tool to study the function and evolution of genes and genomes. When applied to the Human genome, it can play a key role in understanding the origin of many human diseases like Cancer.
However, obtaining meaningful insights into any medical condition and/or pathological state requires the input of High-Quality data. Observations and/or conclusions based on incomplete and/or low quality data are not only hard to replicate and reproduce, but they are also highly questionable.
The vast majority of the Human Next-Generation Sequencing (NGS) datasets have been deposited in the National Center for Biotechnology Information (NCBI) - Small Read Archive (SRA) database.
This project started with the aim of re-analyzing a selected set of Cancer-related NCBI-SRA datasets in order to evaluate our ability to both reproduce and replicate previously published results, using a set of, in-house, newly developed algorithms.
To our surprise, we found that the overall quality, and specially the genome coverage of these selected datasets was not only highly variable, but especially low in coverage, and non-uncommonly, contained contaminating sequences.
In our view, these observations put into question the reproducibility and replicability potential of work based on these datasets.
We conclude that in order to guarantee the replicability and reproducibility in Science, public databases, like the NCBI-SRA, need to set higher standards for data submission.

Files

Name	Size	Download all
Genome_Coverage.pdf md5:671b72be2f2be8468a2e8e93ee07d2d9	493.5 kB	Preview Download