Dataset Open Access

DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea

Ali Alishum


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="999" ind1="C" ind2="5">
    <subfield code="x">Parks, D. H., et al. (2018). "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nature Biotechnology.</subfield>
  </datafield>
  <datafield tag="999" ind1="C" ind2="5">
    <subfield code="x">Cole, J. R., Q. Wang, J. A. Fish, B. Chai, D. M. McGarrell, Y. Sun, C. T. Brown, A. Porras-Alfaro, C. R. Kuske, and J. M. Tiedje. 2014. Ribosomal Database Project: data and tools for high throughput rRNA analysis Nucl. Acids Res. 42(Database issue):D633-D642; doi: 10.1093/nar/gkt1244 [PMID: 24288368]</subfield>
  </datafield>
  <datafield tag="999" ind1="C" ind2="5">
    <subfield code="x">NCBI 16S RefSeq Nucleotide sequence records: https://www.ncbi.nlm.nih.gov/nuccore?term=33175%5BBioProject%5D+OR+33317%5BBioProject%5D</subfield>
  </datafield>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">aig</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">DADA2 format</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">16S rRNA</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Bacterial</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Archaeal</subfield>
  </datafield>
  <controlfield tag="005">20191101071105.0</controlfield>
  <datafield tag="500" ind1=" " ind2=" ">
    <subfield code="a">The RefSeq+RDP database was updated due to a quotation mark bug that was wrongly placed in front of some of the species names. A file with all the affected species names has been uploaded to review. This shouldn't affect any assignments but might have caused some issues reading into R.  

The GTDB was updated due to a new release with taxonomy changes has been made available. The core GTDB team advises that everyone using the GTDB to convert to the release 89. I have also formatted all the 16S rRNA sequences in the GTDBr89 that have passed QC. If anyone finds a need for them I can share outside of here because I do not want to confuse anyone. Also, you can download the file "ssu_r89.tsv" unformatted from the GTDB website shared above.

Python script can be provided on request.</subfield>
  </datafield>
  <controlfield tag="001">3266798</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Trend Laboratory, Curtin University of Technology</subfield>
    <subfield code="0">(orcid)0000-0003-4498-2870</subfield>
    <subfield code="4">prc</subfield>
    <subfield code="a">Ali Alishum</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Trend Laboratory, Curtin University of Technology</subfield>
    <subfield code="0">(orcid)0000-0003-2217-3247</subfield>
    <subfield code="4">oth</subfield>
    <subfield code="a">Seersholm Frederik</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Commonwealth Scientific and Industrial Research Organisation (CSIRO)</subfield>
    <subfield code="0">(orcid)0000-0003-4028-9243</subfield>
    <subfield code="4">cur</subfield>
    <subfield code="a">Greenfield Paul</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">WA Human Microbiome Collaboration Centre (WAHMCC), Trend Laboratory, Curtin University</subfield>
    <subfield code="0">(orcid)0000-0003-1591-5871</subfield>
    <subfield code="4">res</subfield>
    <subfield code="a">Christophersen Claus</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">7522880</subfield>
    <subfield code="z">md5:f9a75fd361d70b03483d64583fa7aaa2</subfield>
    <subfield code="u">https://zenodo.org/record/3266798/files/GTDB_bac120_arc122_ssu_r89.fa.gz</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">4010165</subfield>
    <subfield code="z">md5:3a1e9c128c937e5f0c67a86a4d64868f</subfield>
    <subfield code="u">https://zenodo.org/record/3266798/files/RefSeq-RDP16S_v3_May2018.fa.gz</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">13051</subfield>
    <subfield code="z">md5:13cf96c338c6f56fbeab06f8cdf7e423</subfield>
    <subfield code="u">https://zenodo.org/record/3266798/files/Version2AffectedSeqs.txt</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2019-07-03</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">user-zenodo</subfield>
    <subfield code="o">oai:zenodo.org:3266798</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Trend Laboratory, Curtin University of Technology</subfield>
    <subfield code="0">(orcid)0000-0003-4498-2870</subfield>
    <subfield code="a">Ali Alishum</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">DADA2 formatted 16S rRNA gene sequences for both bacteria &amp; archaea</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-zenodo</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">http://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;These two combined bacterial and archaeal 16S rRNA gene sequence databases were collated from various sources and formatted for the purpose of using the &amp;quot;assignTaxonomy&amp;quot; command within the DADA2&amp;nbsp;pipeline.&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;RefSeq+RDP: This database contains 14676 bacterial &amp;amp; 660 archaea full 16S rRNA gene sequences.&amp;nbsp; It was compiled in 14/05/2018 from predominantly the NCBI RefSeq 16S rrna database (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/)&amp;nbsp;and was supplemented with extra&amp;nbsp;sequences from the&amp;nbsp;RDP database (https://rdp.cme.msu.edu/misc/resources.jsp).&lt;/li&gt;
	&lt;li&gt;Genome Taxonomy Database (GTDB): The new version of our dada2 formatted GTDB reference sequences now contains 17460&amp;nbsp;bacteria and 873&amp;nbsp;archaea full 16S rRNA gene sequences. The reduction in the number of species (bac =23,458 species &amp;amp; arc= 1248 species) as far as I&amp;nbsp;understand&amp;nbsp;was due to a new approach they have taken, where species were clustered according to their genome nucleotide identity and a representative species annotation was given to all belonging to the same cluster. If you wonder why there are fewer species with 16S rRNA, that is because some metagenomics assembled genomes (MAGs) lack&amp;nbsp;the 16S gene and thus cannot be extracted. I believe the reason why on the r89 release notes it mentions higher numbers because they are not limited to 16S only for species identification and as is mentioned they use other single-copy genes for that purpose anyway. The database was downloaded from (&lt;a href="https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/"&gt;https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/&lt;/a&gt;)&amp;nbsp;on 03/07/2019. Please read the release notes and file descriptions.&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The formatting to DADA2 format of the databases was done using a locally written python 2.7 / bash scripts. The script&amp;nbsp;takes&amp;nbsp;as input a taxonomy .tsv&amp;nbsp;file and a fasta&amp;nbsp;file as provided by the core databases creators and then these two files are matched according to a unique sequence identifier available in both files. Then it&amp;nbsp;outputs a fasta file with all 7 taxonomy ranks separated by &amp;quot;;&amp;quot; as required for DADA2 compatibility. Additionally,&amp;nbsp;we have concatenated&amp;nbsp;the unique&amp;nbsp;sequence ID be it NCBI/RDP or GTDB&amp;nbsp;ID to the species entry. We see this as an important QC step to highlight the issues/confidence associated with short read taxonomy assignment at the more finer rank levels.&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.2541238</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.3266798</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
5,701
5,361
views
downloads
All versions This version
Views 5,701835
Downloads 5,361327
Data volume 21.6 GB1.3 GB
Unique views 4,585662
Unique downloads 2,643211

Share

Cite as