DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea

Ali, Alishum

doi:10.5281/zenodo.13984843

Published October 24, 2024 | Version 4.5

Dataset Open

DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea

Ali, Alishum (Contact person)¹

1. Trend Laboratory, Curtin University of Technology

Contributors

1. Trend Laboratory, Curtin University of Technology
2. Commonwealth Scientific and Industrial Research Organisation (CSIRO)
3. WA Human Microbiome Collaboration Centre (WAHMCC), Trend Laboratory, Curtin University

This version is to stay up to date with the improvements and increase in 16S rRNA gene sequences (SSU) added to the GTDB release 220. Please read this post for the stats on the updates. https://gtdb.ecogenomic.org/stats/r220 .

There has been no change to the RDP-RefSeq reference database please use previous versions.

If anyone has concerns with MAG extracted 16S rRNA gene contamination concerns, then I suggest that they contact the curators of GTDB themselves because it is outside of my role with these resources designed for DADA2 usage only.

Another concern that was raised was the orientation of the DB sequences, to get past this problem please use the tryRC = TRUE argument in the assignTaxonomy command within DADA2, this will search your ASVs in the reverse complement as well.

The bacterial and archaeal 16S rRNA gene sequence databases were collated from various sources and formatted to use the "assignTaxonomy" command within the DADA2 pipeline. The data was converted to suite DADA2 format by Alishum Ali.

Genome Taxonomy Database (GTDB): The new version of our dada2 formatted GTDB reference sequences now contains 58102 bacteria and 3672 archaea full 16S rRNA gene sequences. If you wonder why there are fewer species with 16S rRNA, that is because some metagenomics-assembled genomes (MAGs) lack the 16S gene and thus cannot be extracted. The database was downloaded from https://data.ace.uq.edu.au/public/gtdb/data/releases/ on 24/10/2024. Please read the release notes and file descriptions.

The formatting to DADA2 was done using simple awk bash scripts. The script takes as input a fasta file and a tab-delimited taxonomy file (slightly edited to remove special characters) and then it outputs a fasta file with all 7 taxonomy ranks separated by ";" as required for DADA2 compatibility. Additionally, we have concatenated the unique sequence GTDB ID to the species entry (but replaced the "." with an " _". We see this as an important QC step to highlight the issues/confidence associated with short-read taxonomy assignment at the finer rank levels.

Also, this update includes two other files that you can use with the assignTaxonomy and addSpecies commands in DADA2.

Notes

Bash script can be provided on request.

Files

Files (53.4 MB)

Name	Size	Download all
GTDB_bac120_arc53_ssu_r220_fullTaxo.fa.gz md5:7b9bd354ee5410dcdc9181a59b036317	18.5 MB	Download
GTDB_bac120_arc53_ssu_r220_genus.fa.gz md5:d26830c27a30de9fb1d22845127a02c9	17.5 MB	Download
GTDB_bac120_arc53_ssu_r220_species.fa.gz md5:ca4c44f0cbea6df22e638d88937e748d	17.4 MB	Download

Additional details

Created: 2023-12-19

Parks, D. H., et al. (2018). "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nature Biotechnology.
Cole, J. R., Q. Wang, J. A. Fish, B. Chai, D. M. McGarrell, Y. Sun, C. T. Brown, A. Porras-Alfaro, C. R. Kuske, and J. M. Tiedje. 2014. Ribosomal Database Project: data and tools for high throughput rRNA analysis Nucl. Acids Res. 42(Database issue):D633-D642; doi: 10.1093/nar/gkt1244 [PMID: 24288368]
NCBI 16S RefSeq Nucleotide sequence records: https://www.ncbi.nlm.nih.gov/nuccore?term=33175%5BBioProject%5D+OR+33317%5BBioProject%5D

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	33,317	426
Downloads	45,817	212
Data volume	662.6 GB	4.1 GB

DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea

Contributors

Contact person:

Data curator:

Researcher:

Notes

Files

Files (53.4 MB)

Additional details

Dates

References

DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea

Creators

Contributors

Contact person:

Data curator:

Researcher:

Description

Notes

Files

Files (53.4 MB)

Additional details

Dates

References