Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published August 21, 2024 | Version v1
Dataset Open

Creating a 7,000 strains genotype-phenotype dataset of E. coli and antimicrobial resistance phenotypes

Contributors

Project manager:

Description

Description

This Zenodo repository contains the data (except for the input fastq files available on SRA and intermediary files generated during the variant calling process) and code to recapitulate the study from https://doi.org/10.57844/arcadia-d2cf-ebe5 and the associated GitHub repository, where the code, pipelines, and analysis are described in more detail.

 

Work summary

In this work, we established a framework for compiling large genotype-phenotype datasets and produced a large-scale dataset of more than 7,000 E. coli strains and antimicrobial resistance phenotypes.

We leveraged the genetic information and antimicrobial resistance (AMR) phenotype data available for the bacterium Escherichia coli to construct our dataset and took advantage of the existing knowledge about genetic variations and AMR phenotypes to validate our approach and dataset. We performed variant calling and compiled a genotype-phenotype dataset for more than 7,000 E. coli strains. Briefly, variant calling consists of identifying all genetic variations and their associated genotypes in a population compared to a reference genome. This is performed by aligning sequencing reads for each strain of the population against a reference genome, then identifying polymorphic regions in the population, and finally characterizing variants and their genotypes at each of these polymorphic regions.

We have generated a dataset that successfully revealed significant genetic diversity and identified 2.4 million variants. By focusing on non-silent variants within genes associated with AMR, we confirmed the dataset's accuracy. 

We hope this study is a foundational resource for conducting large-scale genotype-phenotype studies that will offer valuable insights for genetics investigations, informing the development of treatments and prevention strategies for AMR. This resource is invaluable for microbiologists and epidemiologists seeking to understand AMR mechanisms and improve genotype-phenotype predictions in pathogenic E. coli outbreaks. Additionally, it's of particular interest to geneticists and evolutionary biologists, providing a dataset to develop strategies for studying genetic interactions and broader applications in phenotype-phenotype predictions and phylogenetic research.

Data organization

Data are organized in the compressed folder. Briefly, they’re divided into two main folders.

The first folder, dataset_generation, includes the code and information necessary to build the genotype dataset and perform the variant calling. It covers major steps like the generation of the reference pangenome used for variant calling, the variant calling pipeline applied to each of the 7,000 strains, the filtering of false positive variants, and the annotation of the variants. 

The second section, dataset_analysis, includes the code and information used to process and analyze the dataset and generate figures for the Pub (https://doi.org/10.57844/arcadia-d2cf-ebe5). It includes the preliminary analysis of AMR phenotypes within the population and the analysis of variants regarding known AMR phenotypes.

 

Files description

The following table provides a list and description of the different files and their locations.

File name Location Description
variant_calling_pipeline dataset_generation/scripts/ Snakefile: performs variant calling from raw paired-end sequencing files and generate one vcf.gz file per sample
snakemake_ECOR72_annotation Snakefile: performs Prokka annotation on inputs whole genome fastq files
ECOR72_and_DP_threshold_analysis.Rmd R markdown: analyses the coverage of known present and absebt loci in the ECOR population
average_coverage_41.csv dataset_generation/data/dp_threshold/ Pangenome loci read coverage information for 40 ECOR strains
average_coverage_last32.csv Pangenome loci read coverage information for 32 ECOR strains
whole_pan_ecor_presence_absence.csv Reformated pangenome loci presence-absence in ECOR strains
pangenome_genomes_SRA_GCA.csv Correspondance table between ECOR72 strains genome names and raw sequencing files SRA accession number
index_loci_pangenome_good.txt List of indexed positions in the pangenome
list_ecor_txtfiles.txt List of txt files (containing the DP information per nucleotide) to use - This corresponds to the files for each 72 ECOR strains
ECOR72_SRA_and_assembly_accessions.csv dataset_generation/data/ List of Genome accession number and the SRA accession number of the associated sequencing files for the 72 ECOR strains
sample_list_SRA.csv List of SRA accession numbers of the E. coli strains used for variant calling
gene_presence_absence.csv dataset_generation/results/pangenome_cds/ Roary output of presence-absence of the pangenome cds loci in the ECOR72 strains
genes.gff Annotation file of the pangenome cds sequences (Prokka output)
pangenome_cds.fa Roary output cds_pangenome sequencing file
summary_statistics.txt Roary statistics output of creations of the cds pangenome
roary_output Roary output folder
IGR_presence_absence.csv dataset_generation/results/pangenome_igr/ Piggy output of presence-absence of the pangenome igr loci in the ECOR72 strains
pangenome_igr.fasta Piggy output igr_pangenome sequencing file
piggy_output Piggy output folder
whole_pangenome.fasta dataset_generation/results/pangenome_whole/ whole pangenome sequences
annot_summary_filtered.html dataset_generation/results/vcf/ Summary of snpEff annotations
annotated_output.vcf.gz snpEff annotated vcf file
annotated_output.vcf.gz.csi indexed annotated vcf file
filtered_output.vcf.gz filtered vcf file (removed low coveraged and low quality variants)
filtered_output.vcf.gz.csi indexed filtered vcf files
output.non_silent.vcf.gz vcf file containing only the nonsilent variants in the pangenome cds loci
merged_output_listN.vcg.gz intermediary vcf files of 1000 merged strains vcf - these intermediary merged files are numbered from 1 to 7
merged_output_all.vcf.gz final vcf.gz files of all merged vcf files in this study
List_N_merging.txt dataset_generation/data/vcf_merging/ List of the 1000 vcf.gz files to be merged together. There are 7 lists, numbered from 1 to 7
ecor72_array.txt dataset_generation/results/ecor72_DP/ Consolidate DP information per nucleotide for each ECOR strain
variants_pos.tsv dataset_analysis/data/variant_analysis/ List of all the variants found in the population and  identified by their locus and position within the locus
allele_freqs.txt Variant frequency informations
variants_non_silent_pos.tsv List of all the non-silent  variants found in the population and  identified by their locus and position within the locus
allele_non_silent_freqs.txt Non-silent variant frequency informations
cds_eggNog.tsv eggNog output file of the pangenome annotation
COG_functional_categories.csv Correspondance between COG functional categories and higher-order annotation
BVBRC_genome_May31.csv dataset_analysis/data/dataset_analysis List of E. coli with available genomes as reported in BCBRV database
BVBRC_genome_amr_May31.csv E. coli antimicrobial resistance information available in BCBRV database
antibiotic_class.csv Antibiotic name and Antibiotic class information
resistance_output.non_silent.vcf.gz dataset_analysis/data/antimicrobial_resistance_analysis vcf.gz file of the loci expected to be associated with antimicrobial resistance
antibiotic_resistance_freq.csv Frequency information for the non-silent variant in the selected antimicrobial genes
SRA_to_genome_name.csv correspondence between strain SRA accession number and genome name (as reported in BVBRC)
Dataset_metainfo_AMR_analysis.Rmd dataset_analysis/scripts R markdown: conducts the characterization of the population and analysis of the AMR phenotype distribution
Variant_population_analysis.Rmd R markdown: conducts the analysis and investigation of identified variants in the population
Antimicrobial_resistance_investigation.Rmd R markdown: conducts the antimicrobial resistance investigation 

 

 

Files

ecoli_7000_amr.zip

Files (6.1 GB)

Name Size Download all
md5:545f528ea3133892a471e40e88cea7e2
6.1 GB Preview Download