{
  "DOI": "10.5281/zenodo.3188334",
  "abstract": "These two combined bacterial and archaeal 16S rRNA gene sequence databases were collated from various sources and formatted for the purpose of using the \"assignTaxonomy\" command within the DADA2\u00a0pipeline.\n\n\n\n\t\nRefSeq+RDP: This database contains 14676 bacterial & 660 archaea full 16S rRNA gene sequences.\u00a0 It was compiled in 14/05/2018 from predominantly the NCBI RefSeq 16S rrna database (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/)\u00a0and was supplemented with extra\u00a0sequences from the\u00a0RDP database (https://rdp.cme.msu.edu/misc/resources.jsp).\n\t\nGenome Taxonomy Database (GTDB): our dada2 formatted GTDB reference sequence set contains 20486 bacteria and 1073 archaea full 16S rRNA gene sequences. The database was downloaded from (http://gtdb.ecogenomic.org/downloads)\u00a0on 20/11/2018.\n\n\n\nThe formatting to DADA2 format of the databases was done using a locally written python 2.7 script. The script\u00a0takes\u00a0as input a taxonomy .txt file and a fasta\u00a0file as provided by the core databases creators and then these two files are matched according to a unique sequence identifier available in both files. Then it\u00a0outputs a fasta file with all 7 taxonomy ranks separated by \";\" as required for DADA2 compatibility. Additionally,\u00a0we have concatenated\u00a0the unique\u00a0sequence ID be it NCBI/RDP or GTDB\u00a0ID to the species entry. We see this as an important QC step to highlight the issues/confidence associated with short read taxonomy assignment at the more finer rank levels.",
  "author": [
    {
      "family": "Ali Alishum"
    }
  ],
  "id": "3188334",
  "issued": {
    "date-parts": [
      [
        "2019",
        "01",
        "16"
      ]
    ]
  },
  "language": "aig",
  "publisher": "Zenodo",
  "title": "DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea",
  "type": "dataset",
  "version": "Version 2"
}