Published September 28, 2022 | Version v1.0.1
Dataset Open

Genome assemblies and respective wg/cgMLST profiles of a diverse dataset comprising 3,076 Campylobacter jejuni isolates

  • 1. Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal
  • 2. Department Biological Safety, German Federal Institute for Risk Assessment, Berlin, Germany



This dataset comprises the genome assemblies and respective 2,794-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 3,076 Campylobacter jejuni samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of Sequence Type [ST]). In total, 476 different STs are represented in this dataset, with ST21, ST50, ST48, ST45 and ST257 being the most represented ones and, together, corresponding to 29.1% of the dataset.

File “Cj_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST.

The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file. 

The file “profiles/Cj_profiles_wgMLST.tsv” corresponds to a tab separated file with the 2,794-loci wgMLST profiles of each solate presented in the metadata file. The files “profiles/Cj_profiles_cgMLST_95.tsv”, “profiles/Cj_profiles_cgMLST_98.tsv” and “profiles/Cj_profiles_cgMLST_100.tsv” correspond to a 1,012-loci, 987-loci and 29-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.


Dataset selection and curation

With the objective of creating a diverse dataset of C. jejuni genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at PubMLST database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 3,539 samples. The majority of them are associated with the INNUENDO project (Llarena et al. 2018). The remaining ones are associated with five BioProjects (PRJEB31119, PRJEB38253, PRJEB40238, PRJEB4165 and PRJNA350537). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 3,076 isolates passed this curation step and were included in the final dataset. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 2,794-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 2,794-loci wgMLST profiles of the 3,076 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 1,012-loci, 987-loci and 29-loci allelic matrices, respectively).



We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.


Files (1.6 GB)

Name Size Download all
268.7 kB Download
1.6 GB Preview Download

Additional details


One Health EJP – Promoting One Health in Europe through joint actions on foodborne zoonoses, antimicrobial resistance and emerging microbiological hazards. 773830
European Commission