Published June 19, 2019 | Version v1
Journal article Open

Big Data Knowledge of Major Lineages of Life and Priorities for Genomic Research

  • 1. National Museum of Natural History, Smithsonian Institution, Washington, D.C., United States of America
  • 2. National Museum of Natural History, Smithsonian Institution, Washington, D.C., United States of America|National Museum of Natural History, Smithsonian Institution, Washington, D.C., United States of America
  • 3. Botanic Garden and Botanical Museum Berlin, Berlin, Germany|Botanic Garden and Botanical Museum Berlin, Berlin, Germany
  • 4. Natural History Museum Denmark, Copenhagen, Denmark

Description

Genomic science is revolutionizing and accelerating biodiversity research. For collections-based institutions to continue to lead and support biodiversity research, they must adapt to this new reality. Simultaneously, "big data" is accumulating so rapidly that we have unprecedented capacity to plan strategically to use genomics to advance basic and applied science on multiple fronts. For example, seven "big data" sources have the following numbers of records (2018 data): Global Biodiversity Information Facility (GBIF), ~1B; Biodiversity Heritage Library (BHL), ~3.6M;  National Center for Biotechnology Information (NCBI), ~220M; Open Tree of Life (OToL), 1.9M;  Barcode of Life Data System (BOLD), ~6.3M; Encyclopedia of Life (EOL), ~99K;  Global Genome Biodiversity Network (GGBN), ~2M. Collectively, they offer more than 1.2B records on biodiversity. At the scale of species (~2M described, multiple millions undescribed), these data are still too sparse to permit comprehensive conclusions. At the scale of families (i.e. deeper clades of life), the situation is far more promising: about 9,911 families are known, and relatively few are discovered each year. This suggests that at the family rank (and above), our knowledge of life on Earth is reasonably complete. Approximately 160,000 valid and accepted genera exist, but certainly many new genera await discovery and description. Genomics is the fastest way to group species into more inclusive lineages such as genera and families, and is certainly faster than traditional alpha taxonomy. Synergistically, these "big data" answer four important questions at deeper clade levels: What is it? Where is it? What do we know about it? What do we know about its genome? Approximately 4,500 eukaryotic genomes have been sequenced. The converse of what we know is what we do not know, another meaning of "dark taxa." We can use the distribution and density of big data at deeper clade levels (families, genera) to quantitatively analyze "dark taxa" and therefore to strategically optimize knowledge and preservation of biodiversity at a global scale. Technicalities of the quantitative prioritization scheme are debatable, but some initial, simple scoring systems can help to prioritize lineages for collection and genetic research so as to most efficiently illuminate regions in the tree of life that that are neither preserved, imaged, geo-located, studied, nor known genomically. This analysis presents criteria and goals for collaborating to build a global genomic collection to maximize efficient acquisition of biodiversity genomic knowledge, and identifies the most valuable and highest priority taxa for genomic research.

Files

BISS_article_37276.pdf

Files (71.5 kB)

Name Size Download all
md5:a4a7fa1fe85d198418a866868bd62ae7
61.5 kB Preview Download
md5:f6966af8e3b78c15b4bd57002148d59c
10.0 kB Preview Download

Linked records