Published June 6, 2026 | Version 1
Preprint Open

The Case for Data Provenance and Authenticity in Genomics

  • 1. ROR icon Syracuse University
  • 2. ROR icon American Type Culture Collection

Description

Abstract

The exponential growth of publicly accessible genomic data over the last two decades has transformed life sciences, yet it has also exposed a critical vulnerability. Weakly enforced requirements for data provenance, structured metadata, and material authentication have degraded the potential of these resources for interoperability and reuse in digital biology. The lack of traceability and verification in genomic data poses escalating risks to scientific reproducibility, biosecurity, and the integrity of AI-driven biological research (AIxBio). Examples from cancer and microbial genomics, infectious disease surveillance, public sequence archives, and emerging AI-enabled biology demonstrate how poor data provenance and metadata quality gaps undermine trust, drive irreproducible results, and create opportunities for data fabrication and misuse. The manuscript further emphasizes that reproducibility alone is insufficient when shared reference data are contaminated, mislabeled, incompletely described, or biologically outdated. Furthermore, the unique role of biological repositories and international culture collections is presented as bridging the physical-to-digital divide and enabling the creation of trusted “digital twins” for biological research. Finally, the proactive preservation of physical reference materials underpinning genomic data and an emphasis on “metadata as infrastructure” is presented as a key ingredient for the future success and sustainability of artificial intelligence and machine learning across the life sciences (i.e., AIxBio). Finally, proactive preservation of physical reference materials and the treatment of “metadata as infrastructure” are presented as key ingredients for the future success and sustainability of artificial intelligence and machine learning across the life sciences.

Files

The Case for Data Provenance in Genomics 05JUN2026.pdf

Files (348.3 kB)

Additional details

Additional titles

Subtitle
Building Trustworthy Foundations for Digital Biology