INNUENDO whole genome and core genome MLST schemas and datasets for Salmonella enterica
Creators
- Mirko Rossi1
- Mickael Santos Da Silva2
- Bruno Filipe Ribeiro-Gonçalves2
- Diogo Nuno Silva2
- Miguel Paulo Machado2
- Mónica Oleastro3
- Vítor Borges3
- Joana Isidro3
- Luis Viera3
- Jani Halkilahti4
- Anniina Jaakkonen5
- Federica Palma6
- Saara Salmenlinna4
- Marjaana Hakkinen5
- Javier Garaizar7
- Joseba Bikandi7
- Friederike Hilbert8
- João André Carriço2
- 1. University of Helsinki
- 2. University of Lisbon
- 3. Instituto Nacional de Saúde Dr. Ricardo Jorge
- 4. Terveyden ja hyvinvoinnin laitos
- 5. Elintarviketurvallisuusvirasto
- 6. ANSES
- 7. University of the Basque Country
- 8. University of Veterinary Medicine, Vienna
Description
Dataset
As reference dataset, 4,307 public available draft or complete genome assemblies and available metadata of Salmonella enterica have been downloaded from public repositories (i.e. EnteroBase, National Center for Biotechnology Information NCBIand The European Bioinformatics Institute EMBL-EBI; accessed April 2017). The collection includes 1,465 S. Enteritidis, 2,442 S.Typhimurium, and 400 of other frequently isolated serovars in Europe. The dataset includes also 153 S.Typhimurium variant 4,[5],12:i:- collected from different Italian regions between 2012 and 2014 during a surveillance study and 129 S. Enteritidis belonging to the INNUENDO sequence dataset (PRJEB27020). The 282 additional genomes were assembled using INNUca v3.1.
File 'Metadata/Senterica_metadata.txt' contains metadata information for each strain including source classification, host taxa, year and country of isolation, serotype, classical pubMLST 7 genes ST classification, and source/method of the assembly.
The directory 'Genomes' contains all the 4,589 assemblies of the strains listed in 'Metadata/Senterica_metadata.txt'. Please note that genomes marked as 'Enterobase' have been downloaded from Enterobase webpage http://enterobase.warwick.ac.uk.
Schema creation and validation
The wgMLST schema from EnteroBase have been downloaded and curated using chewBBACA AutoAlleleCDSCuration for removing all alleles that are not coding sequences (CDS). The quality of the remain loci have been assessed using chewBBACA Schema Evaluation and loci with single alleles, those with high length variability (i.e. if more than 1 allele is outside the mode +/- 0.05 size) and those present in less than 0.5% of the Salmonella genomes in EnteroBase at the date of the analysis (April 2017) have been removed. The wgMLST schema have been further curated, excluding all those loci detected as “Repeated Loci” and loci annotated as “non-informative paralogous hit (NIPH/ NIPHEM)” or “Allele Larger/ Smaller than length mode (ALM/ ASM)” by the chewBBACA Allele Calling engine in more than 1% of a dataset composed by 4,589 Salmonella genomes.
File 'Schemas/Senterica_wgMLST_ 8558_schema.tar.gz' contains the wgMLST schema formatted for chewBBACA and includes a total of 8,558 loci.
File 'Schemas/Senterica_cgMLST_ 3255_listGenes.txt' contains the list of genes from the wgMLST schema which defines the cgMLST schema. The cgMLST schema consists of 3,255 loci and has been defined as the loci present in at least the 99% of the 4,589 Salmonella genomes. Genomes have no more than 2% of missing loci.
File 'Allele_Profles/Senterica_wgMLST_alleleProfiles.tsv' contains the wgMLST allelic profile of the 4,589 Salmonella genomes of the dataset. Please note that missing loci follow the annotation of chewBBACA Allele Calling software.
File 'Allele_Profles/Senterica_cgMLST_alleleProfiles.tsv' contains the cgMLST allelic profile of the 4,589 Salmonella genomes of the dataset. Please note that missing loci are indicated with a zero.
Additional citations
The schema are prepared to be used with chewBBACA. When using the schema in this repository please cite also:
Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166 http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000166
Salmonella enterica schema is a derivation of EnteroBase Salmonella EnteroBase wgMLST schema. When using the schema in this repository please cite also:
Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M (2018) A genomic overview of the population structure of Salmonella. PLoS Genet 14 (4):e1007261. https://doi.org/10.1371/journal.pgen.1007261
Notes
Files
Salmonella_enterica.zip
Files
(6.1 GB)
Name | Size | Download all |
---|---|---|
md5:6f1e56c44e473b6ba942bfe596312955
|
6.1 GB | Preview Download |