{
  "DOI": "10.5281/zenodo.3266798",
  "abstract": "These two combined bacterial and archaeal 16S rRNA gene sequence databases were collated from various sources and formatted for the purpose of using the \"assignTaxonomy\" command within the DADA2\u00a0pipeline.\n\n\n\n\t\nRefSeq+RDP: This database contains 14676 bacterial & 660 archaea full 16S rRNA gene sequences.\u00a0 It was compiled in 14/05/2018 from predominantly the NCBI RefSeq 16S rrna database (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/)\u00a0and was supplemented with extra\u00a0sequences from the\u00a0RDP database (https://rdp.cme.msu.edu/misc/resources.jsp).\n\t\nGenome Taxonomy Database (GTDB): The new version of our dada2 formatted GTDB reference sequences now contains 17460\u00a0bacteria and 873\u00a0archaea full 16S rRNA gene sequences. The reduction in the number of species (bac =23,458 species & arc= 1248 species) as far as I\u00a0understand\u00a0was due to a new approach they have taken, where species were clustered according to their genome nucleotide identity and a representative species annotation was given to all belonging to the same cluster. If you wonder why there are fewer species with 16S rRNA, that is because some metagenomics assembled genomes (MAGs) lack\u00a0the 16S gene and thus cannot be extracted. I believe the reason why on the r89 release notes it mentions higher numbers because they are not limited to 16S only for species identification and as is mentioned they use other single-copy genes for that purpose anyway. The database was downloaded from (https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/)\u00a0on 03/07/2019. Please read the release notes and file descriptions.\u00a0\n\n\n\nThe formatting to DADA2 format of the databases was done using a locally written python 2.7 / bash scripts. The script\u00a0takes\u00a0as input a taxonomy .tsv\u00a0file and a fasta\u00a0file as provided by the core databases creators and then these two files are matched according to a unique sequence identifier available in both files. Then it\u00a0outputs a fasta file with all 7 taxonomy ranks separated by \";\" as required for DADA2 compatibility. Additionally,\u00a0we have concatenated\u00a0the unique\u00a0sequence ID be it NCBI/RDP or GTDB\u00a0ID to the species entry. We see this as an important QC step to highlight the issues/confidence associated with short read taxonomy assignment at the more finer rank levels.",
  "author": [
    {
      "family": "Ali Alishum"
    }
  ],
  "id": "3266798",
  "issued": {
    "date-parts": [
      [
        "2019",
        "07",
        "03"
      ]
    ]
  },
  "language": "aig",
  "publisher": "Zenodo",
  "title": "DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea",
  "type": "dataset",
  "version": "Version 2"
}