Database of 16S sequences from SILVA (r114), filtered, curated and annotated to be used easily by programs of taxonomic assignments

TERRAT, Sébastien; COTTIN, Aurélien; DEQUIEDT, Samuel; KARIMI, Battle; CHEMIDLIN PREVOST-BOURE, Nicolas; MARON, Pierre-Alain; RANJARD, Lionel

doi:10.5281/zenodo.1065438

Published November 23, 2017 | Version v1

Dataset Open

Database of 16S sequences from SILVA (r114), filtered, curated and annotated to be used easily by programs of taxonomic assignments

1. INRA - UMR 1347 Agroecology

The database used for the taxonomic assignment of reads generally comes from the SILVA database (http://www.arb-silva.de/). The logic behind this database is to use the information from the best one to the worst one. This is why the curated database was splitted in two parts : the [C] sequences for Complete sequences in terms of taxonomy, and the [I] and [E] sequences, for Incomplete and Environmental sequences.

Each sequence included into the database must have a specific format summarizing all needed information (example below):
>[I]AACY020336309;Archaea(superkingdom);Euryarchaeota(phylum);Thermoplasmata(class);Thermoplasmatales(order);Marine_Group_II(no_rank);;marine_metagenome

This sequence is an incomplete one ([I]), with a specific accession number from NCBI or SILVA, or another database (AACY020336309). Then, all taxonomic data is separated using ';' characters, for each considered level (superkingdom, phylum,
class, order, family, and genus). The species name is the last one and separated by two ';' characters from the rest of the descriptive line. Finally, the descriptive line must not contain specific characters like spaces. If one or several levels are unknown, this is indicated by 'no_rank'.

Another example here for [C] sequences:
>[C]AAAK03000010;Bacteria(superkingdom);Firmicutes(phylum);Bacilli(class);Lactobacillales(order);Enterococcaceae(family);Enterococcus(genus);;Enterococcus_faecium_DO
This sequence is a complete one ([C]), with a specific accession number from NCBI or SILVA, or another database (AACY020187844). Then, all taxonomic data is separated using ';' characters, for each considered level (superkingdom, phylum,
class, order, family, and genus). The species is the last one and separated by two ';' characters from the rest of the descriptive line. Complete sequences must have six levels of information (superkingdom, phylum, class, order, family, and genus). If it is not the case, the sequence will be considered as Incomplete ([I]) (between three and five levels), or Environmental ([E]) (with only the superkingdom and the phylum levels).

Another example here for [E] sequences:
>[E]U59968;Archaea(superkingdom);Thaumarchaeota(phylum);Soil_Crenarchaeotic_Group(SCG)(no_rank);;uncultured_crenarchaeote
This sequence is a environmental one ([E]), with a specific accession number from NCBI or SILVA, or another database (U59968). Then, all taxonomic data is separated using ';' characters, for each considered level (superkingdom, phylum, class, order, family, and genus). The species is the last one and separated by two ';' characters from the rest of the descriptive line. Complete sequences
must have six levels of information (superkingdom, phylum, class, order, family, and genus). If it is not the case, the sequence will be considered as Incomplete ([I]) (between three and five levels), or Environmental ([E]) (with only the superkingdom and the phylum levels).

More details on the steps defined to clean and define this new database can be available on demand (sebastien.terrat@inra.fr).

Files

Files (952.8 MB)

Name	Size	Download all
C_SILVA_DB_R114.fasta md5:3f3d82b63808daf6744791640643d67a	553.7 MB	Download
EI_SILVA_DB_R114.fasta md5:0b1d9c05e36d081d1917ff64b17d9af0	399.1 MB	Download

	All versions	This version
Views	1,124	1,123
Downloads	183	183
Data volume	100.2 GB	100.2 GB

Database of 16S sequences from SILVA (r114), filtered, curated and annotated to be used easily by programs of taxonomic assignments

Authors/Creators

Description

Files

Files (952.8 MB)