There is a newer version of this record available.

Dataset Open Access

DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea

Ali Alishum

Contact person(s)
Ali Alishum
Data curator(s)
Greenfield Paul
Other(s)
Seersholm Frederik
Researcher(s)
Christophersen Claus

This Version was primarily updated because we have recently updated the RefSeq+RDP database and also included mitochondrial and eukaryotic 16S rRNA sequences. Also because I decided to include the required formats to be able to use the addSpecies command in DADA2. This command searches the database at 100% identity and has the flexibility to either get best hit or multiple hits to your amplicon. I recommend it if you are using a single or 2 region amplicons of the 16S rRNA gene.

These two combined bacterial and archaeal 16S rRNA gene sequence databases were collated from various sources and formatted for the purpose of using the "assignTaxonomy" command within the DADA2 pipeline. The data was converted to suite DADA2 format by Alishum Ali.

  1. RefSeq+RDP: This database contains 22433 bacterial, 1055 archaea and 99 eukaryotic full lengths16S rRNA gene sequences.  It was compiled by Paul Greenfield on the 06/11/2020 from predominantly the NCBI RefSeq 16S rRNA database (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/) and was supplemented with extra sequences from the RDP database (https://rdp.cme.msu.edu/misc/resources.jsp).
  2. Genome Taxonomy Database (GTDB): The new version of our dada2 formatted GTDB reference sequences now contains 21965 bacteria and 1126 archaea full 16S rRNA gene sequences. If you wonder why there are fewer species with 16S rRNA, that is because some metagenomics assembled genomes (MAGs) lack the 16S gene and thus cannot be extracted.  The database was downloaded from https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/ on 19/07/2020. Please read the release notes and file descriptions. 

The formatting to DADA2 was done using a simple awk bash scripts. The script takes as input a fasta file and a tab-delimited taxonomy file (slightly edited to remove special characters) and then it outputs a fasta file with all 7 taxonomy ranks separated by ";" as required for DADA2 compatibility. Additionally, we have concatenated the unique sequence ID be it NCBI/RDP or GTDB ID to the species entry (but replaced the "." with an " _". We see this as an important QC step to highlight the issues/confidence associated with short read taxonomy assignment at the finer rank levels.

Also, this update includes two other files that you can use with the assignTaxonomy and addSpecies commands in DADA2.

Bash script can be provided on request.
Files (72.6 MB)
Name Size
Edited_GTDBr95_taxonomyFile.txt
md5:b87abc2ac6480c6882369a7c4cb2f03f
28.3 MB Download
Edited_RefSeqRDPv16_taxonomyFile.txt
md5:6b42fed0e2cb0a52ebde8fe5253b7a02
6.4 MB Download
GTDB_bac120_arc122_ssu_r95_fullTaxo.fa.gz
md5:e878604fcab569bd34e2bb1780d8a712
6.7 MB Download
GTDB_bac120_arc122_ssu_r95_Genus.fa.gz
md5:9dc6787bc498dbbf13d65b061a8c9390
6.4 MB Download
GTDB_bac120_arc122_ssu_r95_Species.fa.gz
md5:60e03e50fd99a08f36621c3e7fcd2c7d
6.4 MB Download
RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz
md5:42ffdc134b3751f29fa24e42969f0572
6.5 MB Download
RefSeq_16S_6-11-20_RDPv16_Genus.fa.gz
md5:53aac0449c41db387d78a3c17b06ad07
5.9 MB Download
RefSeq_16S_6-11-20_RDPv16_Species.fa.gz
md5:71b1aa865bd2cc63112f3ffcdb78a816
6.0 MB Download
  • Parks, D. H., et al. (2018). "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nature Biotechnology.

  • Cole, J. R., Q. Wang, J. A. Fish, B. Chai, D. M. McGarrell, Y. Sun, C. T. Brown, A. Porras-Alfaro, C. R. Kuske, and J. M. Tiedje. 2014. Ribosomal Database Project: data and tools for high throughput rRNA analysis Nucl. Acids Res. 42(Database issue):D633-D642; doi: 10.1093/nar/gkt1244 [PMID: 24288368]

  • NCBI 16S RefSeq Nucleotide sequence records: https://www.ncbi.nlm.nih.gov/nuccore?term=33175%5BBioProject%5D+OR+33317%5BBioProject%5D

13,128
96,024
views
downloads
All versions This version
Views 13,128632
Downloads 96,024276
Data volume 391.7 GB2.7 GB
Unique views 9,979518
Unique downloads 25,152173

Share

Cite as