Published September 28, 2022 | Version v1.0.1
Dataset Open

Genome assemblies and respective cgMLST profiles of a diverse dataset comprising 1,874 Listeria monocytogenes isolates

  • 1. Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal
  • 2. Department Biological Safety, German Federal Institute for Risk Assessment, Berlin, Germany

Description

Dataset

This dataset comprises the genome assemblies and respective 1,748-loci core-genome (cg) Multiple Locus Sequence Type (MLST) profiles [Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022)]  of a final set of 1,874 Listeria monocytogenes samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of Sequence Type [ST]). In total, 204 different STs are represented in this dataset, with ST121, ST6, ST9, ST1 and ST155 being in the top 5 and, together, corresponding to 37.9% of the dataset.

File “Lm_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST.

The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file. 

The file “profiles/Lm_profile.tsv” corresponds to a tab separated file with the 1,748-loci cgMLST profile of each isolate presented in the metadata file. These profiles were determined as explained below.

 

Dataset selection and curation

With the objective of creating a diverse dataset of L. monocytogenes genome assemblies, we collected information about the genetic diversity (STs) of the isolates available at BIGSdb-Lm database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,957 samples associated with three previous studies (Moura et al. 2016; Maury et al. 2017; Painset et al. 2019). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,874 isolates passed the dataset curation step and were included in the final dataset. cgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 1,748-loci Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022) and downloaded on June 23rd, 2022.

 

Acknowledgements

We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.

Files

Lm_public.zip

Files (1.7 GB)

Name Size Download all
md5:0b14013b0820b9474e6d2494ec6aa4e0
183.6 kB Download
md5:0ba940bde06376e7cb5e4e2d33607e9c
1.7 GB Preview Download

Additional details

Funding

One Health EJP – Promoting One Health in Europe through joint actions on foodborne zoonoses, antimicrobial resistance and emerging microbiological hazards. 773830
European Commission