Dataset Open Access
This version is to stay up to date with the improvements and increase (50% more) of 16S rRNA gene sequences added to the GTDB release 202. Please read this post for the stats on the updates. https://gtdb.ecogenomic.org/stats/r202 .
There has been no change to the RDP-RefSeq reference database
If anyone has concerns with MAG extracted 16S rRNA gene contamination concerns, then I suggest that they contact the curators of GTDB themselves because it is outside of my role with these resources designed for DADA2 usage only. Another concern that was raised was the orientation of the DB sequences, to get past this problem please use the tryRC = TRUE argument in the assignTaxonomy command within DADA2, this will search your ASVs in the reverse complement as well.
This Version was primarily updated because we have recently updated the RefSeq+RDP database and also included mitochondrial and eukaryotic 16S rRNA sequences. Also because I decided to include the required formats to be able to use the addSpecies command in DADA2. This command searches the database at 100% identity and has the flexibility to either get the best hit or multiple hits to your amplicon. I recommend it if you are using a single or 2 region amplicons of the 16S rRNA gene.
These two combined bacterial and archaeal 16S rRNA gene sequence databases were collated from various sources and formatted for the purpose of using the "assignTaxonomy" command within the DADA2 pipeline. The data was converted to suite DADA2 format by Alishum Ali.
The formatting to DADA2 was done using simple awk bash scripts. The script takes as input a fasta file and a tab-delimited taxonomy file (slightly edited to remove special characters) and then it outputs a fasta file with all 7 taxonomy ranks separated by ";" as required for DADA2 compatibility. Additionally, we have concatenated the unique sequence ID be it NCBI/RDP or GTDB ID to the species entry (but replaced the "." with an " _". We see this as an important QC step to highlight the issues/confidence associated with short read taxonomy assignment at the finer rank levels.
Also, this update includes two other files that you can use with the assignTaxonomy and addSpecies commands in DADA2.
Parks, D. H., et al. (2018). "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nature Biotechnology.
Cole, J. R., Q. Wang, J. A. Fish, B. Chai, D. M. McGarrell, Y. Sun, C. T. Brown, A. Porras-Alfaro, C. R. Kuske, and J. M. Tiedje. 2014. Ribosomal Database Project: data and tools for high throughput rRNA analysis Nucl. Acids Res. 42(Database issue):D633-D642; doi: 10.1093/nar/gkt1244 [PMID: 24288368]
NCBI 16S RefSeq Nucleotide sequence records: https://www.ncbi.nlm.nih.gov/nuccore?term=33175%5BBioProject%5D+OR+33317%5BBioProject%5D