Published August 25, 2022 | Version v1
Dataset Open

Updating splits, lumps, and shuffles: Reconciling GenBank names with standardized avian taxonomies

Description

Abstract Biodiversity research has advanced by testing expectations of ecological and evolutionary hypotheses through the linking of large-scale genetic, distributional, and trait datasets. The rise of molecular systematics over the past 30 years has resulted in a wealth of DNA sequences from around the globe. Yet, advances in molecular systematics also have created taxonomic instability, as new estimates of evolutionary relationships and interpretations of species limits have required widespread scientific name changes. Taxonomic instability, colloquially "splits, lumps, and shuffles," presents logistical challenges to large-scale biodiversity research because (1) the same species or sets of populations may be listed under different names in different data sources, or (2) the same name may apply to different sets of populations representing different taxonomic concepts. Consequently, distributional and trait data are often difficult to link directly to primary DNA sequence data without extensive and time-consuming curation. Here, we present RANT: Reconciliation of Avian NCBI Taxonomy. RANT applies taxonomic reconciliation to standardize avian taxon names in use in NCBI GenBank, a primary source of genetic data, to a widely used and regularly updated avian taxonomy: eBird/Clements. Of 14,341 avian species/subspecies names in GenBank, 11,031 directly matched an eBird/Clements; these link to more than 6 million nucleotide sequences. For the remaining unmatched avian names in GenBank, we used Avibase's system of taxonomic concepts, taxonomic descriptions in Cornell's Birds of the World, and DNA sequence metadata to identify corresponding eBird/Clements names. Reconciled names linked to more than 600,000 nucleotide sequences, ~9% of all avian sequences on GenBank. Nearly 10% of eBird/Clements names had nucleotide sequences listed under 2 or more GenBank names. Our taxonomic reconciliation is a first step towards rigorous and open-source curation of avian GenBank sequences and is available at GitHub, where it can be updated to correspond to future annual eBird/Clements taxonomic updates.

Notes

D1:
"PetersVsClements2Final.txt" - This file tells which species from the Peters taxonomy match the 2019 Clements/ebird taxonomy. The first column has a species name from the Peters taxonomy. In the second column, "Clements" indicates that the species name matches the Clements/ebird taxonomy, "No" means it does match, and "Close" means that the names match when you disregard the last two letters.

"SibleyMonroeVsClements_Final.txt" - This file tells which species from the Sibley Monroe taxonomy match the 2019 Clements/ebird taxonomy. The first column has a species ID number from the Sibley Monroe taxonomy. The second column has the species scientific name from the Sibley Monroe taxonomy. The third column has the common name from the Sibley Monroe taxonomy. In the fourth column, "Clements" indicates that the species name matches the Clements/ebird taxonomy, "No" means it does match, and "Close" means that the names match when you disregard the last two letters.

D2:
"taxonomy_result.unix.xml" - XML file with NCBI taxonomy with the names descending from "Aves" (downloaded May 3, 2020).

"GenBank.AvesSpecies.txt" - This text file has the GenBank species and subspecies names within "Aves". The first column has the GenBank taxon ID number. The second column has the scientific name corresponding to the taxon ID number, and the third column lists whether this name corresponds to a species or a subspecies.

"extractGBnames.pl" - Perl script that reads in "taxonomy_result.unix.xml" and outputs the taxon ID numbers, their corresponding scientific names, and ther rank(e.g. "species").

D3:
"compare.pl" - Perl script that reads in the Clements Ebird 2019 taxonomy (the file "EbirdClements.txt" in D4) and the list of GenBank taxon names from Aves (the file "GenBank.AvesSpecies.txt" in S2). If the GenBank taxon name exactly matches a name in the Clements/Ebird taxonomy, it outputs the GenBank taxon ID, GenBank name, GenBank rank, and all the information associated with the name in the Clements/Ebird taxonomy. If the GenBank name does not match, it just outputs the GenBank taxon ID, GenBank name, and GenBank rank.

D4:
"EbirdClements.txt" - text file with the taxonomic names and associated metadata from the 2019 Ebird Clements dataset. The first column has the code associated with species names; the second column has the rank, the third column has the common name for the species, the fourth column has the scientific name; the fifth column has the range; the sixth column has the order name, and the last column has the family name (with the common family name next to it in parentheses).

D5:
"nucl_gb.accession2taxid.gz" - compressed file that has the GenBank accession numbers from the core nucleotide database and the taxon ID associated with that sequence. This was downloaded from NCBI on Novemeber 2, 2020.

D6:
"taxonomy_result.txt" - text file with a list of GenBank taxon ID numbers associated with species and subspecies within Aves.

"countgb.pl" - Perl script that reads in "taxonomy_result.txt" and "nucl_gb.accession2taxid" (from S5) and outputs the number of GenBank sequences associated with each avian species or subspecies ID number.

D7:
"SraResultInfo.csv" - CSV file that summarizes the data from each run in the NCBI SRA database associated with an Aves taxon. The 28th column has the GenBank taxon ID associatetd with the SRA run. This information was downloaded from NCBI on August 1, 2021.

D8:
"genome_result.txt" - text file with a summary of the genome files in NCBI associated with taxa within Aves. This file was downloaded on September 5, 2021. The taxon name is next to the number of each entry.

"getnames.pl" - Perl script that reads in "genome_result.txt" and outputs a list of avian taxa with genome files and the number of genome files associated with each taxon.

D9:
"MacaulayLibrary_MediaSummary_April_2021.csv" – CSV file summarizing Macaulay Library audio recordings and GenBank nucleotide sequences associated with eBird/Clements 2019 names (downloaded April 2021)

D10:
"Xeno-canto_MediaSummary_October2020.csv" - CSV file summarizing Xeno-canto audio recordings and GenBank nucleotide sequences associated with eBird/Clements 2019 names (downloaded October 2020)

D11:
"GenBank_eBird/Clements2019_taxonomic_reconciliation_12Nov2021.csv" - CSV file reconciling GenBank TaxIDs with eBird/Clements 2019 taxonomy
 
D12:
"TaxonomicReconciliation_IUCNstatus.csv" – CSV file reconciling GenBank TaxIDs with eBird/Clements 2019 taxonomy, with respect to IUCN status

D13:
"Taxonomic_reconciliation_related_to_geographic_realm.csv — CSV file with reconciliation status related to geographic realms

D14:
"TaxonomicReconciliation_Xeno-canto.csv" – CSV file reconciling eBird/Clements 2019 taxonomy, with Xeno-canto, which uses IOC taxonomy

Funding provided by: Villum Fonden
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100008398
Award Number: 25925

Funding provided by: National Science Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000001
Award Number: DEB-1655683

Files

D10_Xeno-canto_MediaSummary_October2020.csv

Files (2.1 GB)

Name Size Download all
md5:7666b1d68e82b6be218e7ba6472aabfe
501.8 kB Preview Download
md5:9bd32822c794f8057821763b7e5f0b8a
1.8 MB Preview Download
md5:0f2e36aef8925b2c1d3f5e6c2f5aff1e
2.1 MB Preview Download
md5:c23195877c7b97fd5929a0b41be470a5
1.5 MB Download
md5:546fe64cc8e484a42c5a260909dd0cea
1.3 MB Preview Download
md5:7770acf74d0c5867bda91bbc25341d12
259.7 kB Preview Download
md5:2d3e75c9109837850db27700b8087220
522.2 kB Preview Download
md5:b46709ebd72bacb1de37841ae82fa228
621 Bytes Download
md5:65261b0b9070898770a49b434fdecc9c
566.3 kB Preview Download
md5:bcf748620707e2065cf8066e154c544c
100.8 MB Preview Download
md5:d6e9c13dc00ee317786ba8dec034a047
1.2 kB Download
md5:6814698405b0cd4f639fa49b205b6182
4.9 MB Preview Download
md5:82633b1505ffd1ae470aeb89a52467ca
2.0 GB Download
md5:9db1e99a8d4ca6e37ed51f76dd74683a
582 Bytes Download
md5:45703cd42a190d174114e9442b59d322
123.3 kB Preview Download
md5:30fa64a698b6ca57bb99ef6738844952
44.5 MB Preview Download
md5:c39fc79cb45eb754e976e7ef6701e66b
65.5 kB Preview Download
md5:a4b32deed583d8c880c8d6879c1a2377
380 Bytes Download
md5:b719b430f865c66f6f61768999f43558
501.7 kB Preview Download
md5:6299064839be19ac9b65b215b4fa742f
4.8 kB Preview Download

Additional details