Published March 12, 2021 | Version 1.0
Dataset Open

Campylobacter dataset with simulated inter and intra genus contaminations

  • 1. Federal institute for risk assessment (BfR)

Description

This dataset serves for the detection of inter and intra genus contaminations in Campylobacter. Its design follows the concepts presented in https://doi.org/10.1186/s13059-019-1914-x: Illumina reads from complete Campylobacter genomes were simulated and artificially mixed at different concentrations and genomic distances. The mixed reads were assembled using shovill.

This dataset contains the following files:

  • simulated_reads.tar: The simulated reads for 248 Campylobacter samples
  • assemblies.tar.gz: Shovill assembly of all read data
  • genome_info_Ca.tsv: Description of the original complete Campylobacter genomes
  • metadata_Ca.tsv: Mixing information for all provided samples

 

Details:

We downloaded all complete Campylobacter genomes from NCBI refseq. Next, we computed the MLST ST using mlst (https://github.com/tseemann/mlst) and excluded all samples without an ST, resulting in a final dataset of 218 samples. We then determined the genetic similarity between these samples by computing pairwise MLST allele distances (usinghttps://github.com/tseemann/cgmlst-dists). For each sample, we attempted to find a close, intermediate and distant matching sample following the proposed definition of https://doi.org/10.1186/s13059-019-1914-x: close (same ST, 0 AD), intermediate (2-6 AD), distant (7 AD). We selected two C. coli and six C. jejuni samples with at least one close, intermediate and distant matching sample. For each species we selected genomes with maximal overall genomic diversity and simulated reads from the selected genomes using ART_Illumina v2.5.8 (see FDA for details). Next, we combined reads from the eight samples and their respective matching samples using the script select_reads.pl (http://github.com/apightling/contamination), in order to create simulated contaminated datasets. Additionally, we created inter-genus contaminants by mixing reads of the eight Campylobacter spp. samples with reads from any of the other three genera (Listeria, Salmonella, Escherichia) of the analogous FDA dataset (https://doi.org/10.6084/m9.figshare.c.4282706.v1).

Files

Files (16.3 GB)

Name Size Download all
md5:be16c528db0f4e71d2a1ce854ace4e9f
208.0 MB Download
md5:8b035340605f8150ee16b77b66df1da3
16.1 GB Download
md5:182f574a0b6792028c4d2a8d13f490ed
2.0 kB Download
md5:290d066f60665b84546d0b19a8a7ba26
11.1 kB Download