Dataset Open Access

INNUENDO whole genome and core genome MLST schemas and datasets for Escherichia coli

Mirko Rossi; Mickael Santos Da Silva; Bruno Filipe Ribeiro-Gonçalves; Diogo Nuno Silva; Miguel Paulo Machado; Mónica Oleastro; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Federica Palma; Saara Salmenlinna; Marjaana Hakkinen; Javier Garaizar; Joseba Bikandi; Friederike Hilbert; João André Carriço


As reference dataset, 2,218 public draft or complete genome assemblies and available metadata of Escherichia coli have been downloaded from EnteroBase in April 2017. Genomes have been selected on the basis of the ribosomal ST (rST) classification available in EnteroBase: from the same rST, genomes have been randomly selected and downloaded. The number of samples for each rST in the final dataset is proportional to those available in EnteroBase in April 2017. The dataset includes also 119 Shiga toxin-producing E.coli genomes assembled with INNUca v3.1 belonging to the INNUENDO Sequence Dataset (PRJEB27020).

File 'Metadata/Ecoli_metadata.txt' contains metadata information for each strain including source classification, taxa of the hosts, country and year of isolation, serotype, pathotype, classical pubMLST 7 genes ST classification, assembly source/method and Enterobase barcode. 

The directory 'Genomes' contains the 119 INNUca v3.1 assemblies of the strains listed in 'Metadata/Ecoli_metadata.txt'. Enterobase assemblies can be downloaded from using 'barcode'.

Schema creation and validation

The wgMLST schema from EnteroBase have been downloaded and curated using chewBBACA AutoAlleleCDSCuration for removing all alleles that are not coding sequences (CDS). The quality of the remain loci have been assessed using chewBBACA Schema Evaluation and loci with single alleles, those with high length variability (i.e. if more than 1 allele is outside the mode +/- 0.05 size) and those present in less than 0.5% of the Escherichia genomes in EnteroBase at the date of the analysis (April 2017) have been removed. The wgMLST schema have been further curated, excluding all those loci detected as “Repeated Loci” and loci annotated as “non-informative paralogous hit (NIPH/ NIPHEM)” or “Allele Larger/ Smaller than length mode (ALM/ ASM)” by the chewBBACA Allele Calling engine in more than 1% of a dataset composed by 2,337 Escherichia coli genomes.

File 'Schema/Ecoli_wgMLST_7601_schema.tar.gz' contains the wgMLST schema formatted for chewBBACA and includes a total of 7,601 loci.

File 'Schema/Ecoli_cgMLST_2360_listGenes.txt' contains the list of genes from the wgMLST schema which defines the cgMLST schema. The cgMLST schema consists of 2,360 loci and has been defined as the loci present in at least the 99% of the 2,337 Escherichia coli genomes. Genomes have no more than 2% of missing loci.

File 'Allele_Profles/Ecoli_wgMLST_alleleProfiles.tsv' contains the wgMLST allelic profile of the 2,337 Escherichia coli genomes of the dataset. Please note that missing loci follow the annotation of chewBBACA Allele Calling software.

File 'Allele_Profles/Ecoli_cgMLST_alleleProfiles.tsv' contains the cgMLST allelic profile of the 2,337 Escherichia coli genomes of the dataset. Please note that missing loci are indicated with a zero.

Additional citations

The schema are prepared to be used with chewBBACA. When using the schema in this repository please cite also:

Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166

Escherichia coli schema is a derivation of EnteroBase E. coli EnteroBase wgMLST schema. When using the schema in this repository please cite also:

Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M (2018) A genomic overview of the population structure of Salmonella. PLoS Genet 14 (4):e1007261.

The isolates' genomes raw sequence data produced within the activity of the INNUENDO project were submitted to the European Nucleotide Archive (ENA) database and are publicly available under the project accession number PRJEB27020. When using the schemas, the assemblies or the allele profiles please include the project number in your publication. The research from the INNUENDO project has received funding from European Food Safety Authority (EFSA), grant agreement GP/EFSA/AFSCO/2015/01/CT2 (New approaches in identifying and characterizing microbial and chemical hazards) and from the Government of the Basque Country. The conclusions, findings, and opinions expressed in this repository reflect only the view of the INNUENDO consortium members and not the official position of EFSA nor of the Government of the Basque Country. EFSA and the Government of the Basque Country are not responsible for any use that may be made of the information included in this repository. The INNUENDO consortium thanks the Austrian Agency for Health and Food Safety Limited for participating in the project by providing strains. The consortium thanks all the researchers and the authorities worldwide which are contributing by submitting the raw sequences of the bacterial strains in public repositories. The project was possible thanks to the support of CSC- Tieteen tietotekniikan keskus Oy ( and of INCD (, funded by FCT and FEDER under the project 22153-01/SAICT/2016) for providing access to cloud computing resources.
Files (313.4 MB)
Name Size
313.4 MB Download
All versions This version
Views 492490
Downloads 7474
Data volume 23.2 GB23.2 GB
Unique views 445443
Unique downloads 6666


Cite as