3188334
doi
10.5281/zenodo.3188334
oai:zenodo.org:3188334
user-rmg
Ali Alishum
Trend Laboratory, Curtin University of Technology
Seersholm Frederik
Trend Laboratory, Curtin University of Technology
Greenfield Paul
Commonwealth Scientific and Industrial Research Organisation (CSIRO)
Christophersen Claus
WA Human Microbiome Collaboration Centre (WAHMCC), Trend Laboratory, Curtin University
DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea
Ali Alishum
Trend Laboratory, Curtin University of Technology
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
DADA2 format
16S rRNA
Bacterial
Archaeal
<p>These two combined bacterial and archaeal 16S rRNA gene sequence databases were collated from various sources and formatted for the purpose of using the "assignTaxonomy" command within the DADA2 pipeline.</p>
<ol>
<li>RefSeq+RDP: This database contains 14676 bacterial & 660 archaea full 16S rRNA gene sequences. It was compiled in 14/05/2018 from predominantly the NCBI RefSeq 16S rrna database (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/) and was supplemented with extra sequences from the RDP database (https://rdp.cme.msu.edu/misc/resources.jsp).</li>
<li>Genome Taxonomy Database (GTDB): our dada2 formatted GTDB reference sequence set contains 20486 bacteria and 1073 archaea full 16S rRNA gene sequences. The database was downloaded from (<a href="https://t.co/bIjprJsYUh">http://gtdb.ecogenomic.org/downloads</a>) on 20/11/2018.</li>
</ol>
<p>The formatting to DADA2 format of the databases was done using a locally written python 2.7 script. The script takes as input a taxonomy .txt file and a fasta file as provided by the core databases creators and then these two files are matched according to a unique sequence identifier available in both files. Then it outputs a fasta file with all 7 taxonomy ranks separated by ";" as required for DADA2 compatibility. Additionally, we have concatenated the unique sequence ID be it NCBI/RDP or GTDB ID to the species entry. We see this as an important QC step to highlight the issues/confidence associated with short read taxonomy assignment at the more finer rank levels.</p>
The RefSeq+RDP database was updated due to a quotation mark bug that was wrongly placed in front of some of the species names. A file with all the affected species names has been uploaded to review. This shouldn't affect any assignments but might have caused some issues reading into R.
Python script can be provided on request.
Zenodo
2019-01-16
info:eu-repo/semantics/other
2541238
user-rmg
Version 2
1655452957.500523
4010165
md5:3a1e9c128c937e5f0c67a86a4d64868f
https://zenodo.org/records/3188334/files/RefSeq-RDP16S_v3_May2018.fa.gz
7114483
md5:307c9d79fb7e167b696fad16f698eb57
https://zenodo.org/records/3188334/files/GTDB_bac-arc_ssu_r86.fa.gz
13051
md5:13cf96c338c6f56fbeab06f8cdf7e423
https://zenodo.org/records/3188334/files/Version2AffectedSeqs.txt
public
10.5281/zenodo.2541238
isVersionOf
doi