Published May 16, 2023
| Version v2023-05-16
Poster
Open
Do Public Databases Need Higher Standards for Next-Generation Data Submissions?
- 1. Department of Biochemistry and Biophysics, Texas A&M University
- 2. Department of Biology; Texas A&M University
Description
- Genomics, an extension of Genetics, is a powerful tool to study the function and evolution of genes and genomes. When applied to the Human genome, it can play a key role in understanding the origin of many human diseases like Cancer.
- However, obtaining meaningful insights into any medical condition and/or pathological state requires the input of High-Quality data. Observations and/or conclusions based on incomplete and/or low quality data are not only hard to replicate and reproduce, but they are also highly questionable.
- The vast majority of the Human Next-Generation Sequencing (NGS) datasets have been deposited in the National Center for Biotechnology Information (NCBI) - Small Read Archive (SRA) database.
- This project started with the aim of re-analyzing a selected set of Cancer-related NCBI-SRA datasets in order to evaluate our ability to both reproduce and replicate previously published results, using a set of, in-house, newly developed algorithms.
- To our surprise, we found that the overall quality, and specially the genome coverage of these selected datasets was not only highly variable, but especially low in coverage, and non-uncommonly, contained contaminating sequences.
- In our view, these observations put into question the reproducibility and replicability potential of work based on these datasets.
- We conclude that in order to guarantee the replicability and reproducibility in Science, public databases, like the NCBI-SRA, need to set higher standards for data submission.
Files
Genome_Coverage.pdf
Files
(493.5 kB)
Name | Size | Download all |
---|---|---|
md5:671b72be2f2be8468a2e8e93ee07d2d9
|
493.5 kB | Preview Download |