Published November 23, 2017 | Version v1
Dataset Open

Database of 16S sequences from SILVA (r114), filtered, curated and annotated to be used easily by programs of taxonomic assignments

Description

The database used for the taxonomic assignment of reads generally comes from the SILVA database (http://www.arb-silva.de/). The logic behind this database is to use the information from the best one to the worst one. This is why the curated database was splitted in two parts : the [C] sequences for Complete sequences in terms of taxonomy, and the [I] and [E] sequences, for Incomplete and Environmental sequences.

Each sequence included into the database must have a specific format summarizing all needed information (example below):
>[I]AACY020336309;Archaea(superkingdom);Euryarchaeota(phylum);Thermoplasmata(class);Thermoplasmatales(order);Marine_Group_II(no_rank);;marine_metagenome

This sequence is an incomplete one ([I]), with a specific accession number from NCBI or SILVA, or another database (AACY020336309). Then, all taxonomic data is separated using ';' characters, for each considered level (superkingdom, phylum, 
class, order, family, and genus). The species name is the last one and separated by two ';' characters from the rest of the descriptive line. Finally, the descriptive line must not contain specific characters like spaces. If one or several levels are unknown, this is indicated by 'no_rank'.

Another example here for [C] sequences:
>[C]AAAK03000010;Bacteria(superkingdom);Firmicutes(phylum);Bacilli(class);Lactobacillales(order);Enterococcaceae(family);Enterococcus(genus);;Enterococcus_faecium_DO
This sequence is a complete one ([C]), with a specific accession number from NCBI or SILVA, or another database (AACY020187844). Then, all taxonomic data is separated using ';' characters, for each considered level (superkingdom, phylum, 
class, order, family, and genus). The species is the last one and separated by two ';' characters from the rest of the descriptive line. Complete sequences must have six levels of information (superkingdom, phylum, class, order, family, and genus). If it is not the case, the sequence will be considered as Incomplete ([I]) (between three and five levels), or Environmental ([E]) (with only the superkingdom and the phylum levels).

Another example here for [E] sequences:
>[E]U59968;Archaea(superkingdom);Thaumarchaeota(phylum);Soil_Crenarchaeotic_Group(SCG)(no_rank);;uncultured_crenarchaeote
This sequence is a environmental one ([E]), with a specific accession number from NCBI or SILVA, or another database (U59968). Then, all taxonomic data is separated using ';' characters, for each considered level (superkingdom, phylum, class, order, family, and genus). The species is the last one and separated by two ';' characters from the rest of the descriptive line. Complete sequences 
must have six levels of information (superkingdom, phylum, class, order, family, and genus). If it is not the case, the sequence will be considered as Incomplete ([I]) (between three and five levels), or Environmental ([E]) (with only the superkingdom and the phylum levels).

More details on the steps defined to clean and define this new database can be available on demand (sebastien.terrat@inra.fr).

Files

Files (952.8 MB)

Name Size Download all
md5:3f3d82b63808daf6744791640643d67a
553.7 MB Download
md5:0b1d9c05e36d081d1917ff64b17d9af0
399.1 MB Download