Campylobacter dataset with simulated inter and intra genus contaminations

Deneke, Carlus

doi:10.5281/zenodo.4601406

Published March 12, 2021 | Version 1.0

Dataset Open

Campylobacter dataset with simulated inter and intra genus contaminations

Deneke, Carlus¹

1. Federal institute for risk assessment (BfR)

This dataset serves for the detection of inter and intra genus contaminations in Campylobacter. Its design follows the concepts presented in https://doi.org/10.1186/s13059-019-1914-x: Illumina reads from complete Campylobacter genomes were simulated and artificially mixed at different concentrations and genomic distances. The mixed reads were assembled using shovill.

This dataset contains the following files:

simulated_reads.tar: The simulated reads for 248 Campylobacter samples
assemblies.tar.gz: Shovill assembly of all read data
genome_info_Ca.tsv: Description of the original complete Campylobacter genomes
metadata_Ca.tsv: Mixing information for all provided samples

Details:

We downloaded all complete Campylobacter genomes from NCBI refseq. Next, we computed the MLST ST using mlst (https://github.com/tseemann/mlst) and excluded all samples without an ST, resulting in a final dataset of 218 samples. We then determined the genetic similarity between these samples by computing pairwise MLST allele distances (usinghttps://github.com/tseemann/cgmlst-dists). For each sample, we attempted to find a close, intermediate and distant matching sample following the proposed definition of https://doi.org/10.1186/s13059-019-1914-x: close (same ST, 0 AD), intermediate (2-6 AD), distant (7 AD). We selected two C. coli and six C. jejuni samples with at least one close, intermediate and distant matching sample. For each species we selected genomes with maximal overall genomic diversity and simulated reads from the selected genomes using ART_Illumina v2.5.8 (see FDA for details). Next, we combined reads from the eight samples and their respective matching samples using the script select_reads.pl (http://github.com/apightling/contamination), in order to create simulated contaminated datasets. Additionally, we created inter-genus contaminants by mixing reads of the eight Campylobacter spp. samples with reads from any of the other three genera (Listeria, Salmonella, Escherichia) of the analogous FDA dataset (https://doi.org/10.6084/m9.figshare.c.4282706.v1).

Files

Files (16.3 GB)

Name	Size	Download all
assemblies.tar.gz md5:be16c528db0f4e71d2a1ce854ace4e9f	208.0 MB	Download
campylobacter_simulated_reads.tar md5:8b035340605f8150ee16b77b66df1da3	16.1 GB	Download
genome_info_Ca.tsv md5:182f574a0b6792028c4d2a8d13f490ed	2.0 kB	Download
metadata_Ca.tsv md5:290d066f60665b84546d0b19a8a7ba26	11.1 kB	Download

	All versions	This version
Views	352	352
Downloads	218	218
Data volume	880.3 GB	880.3 GB

Campylobacter dataset with simulated inter and intra genus contaminations

Creators

Description

Files

Files (16.3 GB)