Creating a 7,000 strains genotype-phenotype dataset of E. coli and antimicrobial resistance phenotypes

Morin, Manon

doi:10.5281/zenodo.12692732

Published August 21, 2024 | Version v1

Dataset Open

Creating a 7,000 strains genotype-phenotype dataset of E. coli and antimicrobial resistance phenotypes

Morin, Manon (Data curator)

Contributors

Project manager:

Mets, David G.

Description

This Zenodo repository contains the data (except for the input fastq files available on SRA and intermediary files generated during the variant calling process) and code to recapitulate the study from https://doi.org/10.57844/arcadia-d2cf-ebe5 and the associated GitHub repository, where the code, pipelines, and analysis are described in more detail.

Work summary

In this work, we established a framework for compiling large genotype-phenotype datasets and produced a large-scale dataset of more than 7,000 E. coli strains and antimicrobial resistance phenotypes.

We leveraged the genetic information and antimicrobial resistance (AMR) phenotype data available for the bacterium Escherichia coli to construct our dataset and took advantage of the existing knowledge about genetic variations and AMR phenotypes to validate our approach and dataset. We performed variant calling and compiled a genotype-phenotype dataset for more than 7,000 E. coli strains. Briefly, variant calling consists of identifying all genetic variations and their associated genotypes in a population compared to a reference genome. This is performed by aligning sequencing reads for each strain of the population against a reference genome, then identifying polymorphic regions in the population, and finally characterizing variants and their genotypes at each of these polymorphic regions.

We have generated a dataset that successfully revealed significant genetic diversity and identified 2.4 million variants. By focusing on non-silent variants within genes associated with AMR, we confirmed the dataset's accuracy.

We hope this study is a foundational resource for conducting large-scale genotype-phenotype studies that will offer valuable insights for genetics investigations, informing the development of treatments and prevention strategies for AMR. This resource is invaluable for microbiologists and epidemiologists seeking to understand AMR mechanisms and improve genotype-phenotype predictions in pathogenic E. coli outbreaks. Additionally, it's of particular interest to geneticists and evolutionary biologists, providing a dataset to develop strategies for studying genetic interactions and broader applications in phenotype-phenotype predictions and phylogenetic research.

Data organization

Data are organized in the compressed folder. Briefly, they’re divided into two main folders.

The first folder, dataset_generation, includes the code and information necessary to build the genotype dataset and perform the variant calling. It covers major steps like the generation of the reference pangenome used for variant calling, the variant calling pipeline applied to each of the 7,000 strains, the filtering of false positive variants, and the annotation of the variants.

The second section, dataset_analysis, includes the code and information used to process and analyze the dataset and generate figures for the Pub (https://doi.org/10.57844/arcadia-d2cf-ebe5). It includes the preliminary analysis of AMR phenotypes within the population and the analysis of variants regarding known AMR phenotypes.

Files description

The following table provides a list and description of the different files and their locations.

File name	Location	Description
variant_calling_pipeline	dataset_generation/scripts/	Snakefile: performs variant calling from raw paired-end sequencing files and generate one vcf.gz file per sample
snakemake_ECOR72_annotation	Snakefile: performs Prokka annotation on inputs whole genome fastq files
ECOR72_and_DP_threshold_analysis.Rmd	R markdown: analyses the coverage of known present and absebt loci in the ECOR population
average_coverage_41.csv	dataset_generation/data/dp_threshold/	Pangenome loci read coverage information for 40 ECOR strains
average_coverage_last32.csv	Pangenome loci read coverage information for 32 ECOR strains
whole_pan_ecor_presence_absence.csv	Reformated pangenome loci presence-absence in ECOR strains
pangenome_genomes_SRA_GCA.csv	Correspondance table between ECOR72 strains genome names and raw sequencing files SRA accession number
index_loci_pangenome_good.txt	List of indexed positions in the pangenome
list_ecor_txtfiles.txt	List of txt files (containing the DP information per nucleotide) to use - This corresponds to the files for each 72 ECOR strains
ECOR72_SRA_and_assembly_accessions.csv	dataset_generation/data/	List of Genome accession number and the SRA accession number of the associated sequencing files for the 72 ECOR strains
sample_list_SRA.csv	List of SRA accession numbers of the E. coli strains used for variant calling
gene_presence_absence.csv	dataset_generation/results/pangenome_cds/	Roary output of presence-absence of the pangenome cds loci in the ECOR72 strains
genes.gff	Annotation file of the pangenome cds sequences (Prokka output)
pangenome_cds.fa	Roary output cds_pangenome sequencing file
summary_statistics.txt	Roary statistics output of creations of the cds pangenome
roary_output	Roary output folder
IGR_presence_absence.csv	dataset_generation/results/pangenome_igr/	Piggy output of presence-absence of the pangenome igr loci in the ECOR72 strains
pangenome_igr.fasta	Piggy output igr_pangenome sequencing file
piggy_output	Piggy output folder
whole_pangenome.fasta	dataset_generation/results/pangenome_whole/	whole pangenome sequences
annot_summary_filtered.html	dataset_generation/results/vcf/	Summary of snpEff annotations
annotated_output.vcf.gz	snpEff annotated vcf file
annotated_output.vcf.gz.csi	indexed annotated vcf file
filtered_output.vcf.gz	filtered vcf file (removed low coveraged and low quality variants)
filtered_output.vcf.gz.csi	indexed filtered vcf files
output.non_silent.vcf.gz	vcf file containing only the nonsilent variants in the pangenome cds loci
merged_output_listN.vcg.gz	intermediary vcf files of 1000 merged strains vcf - these intermediary merged files are numbered from 1 to 7
merged_output_all.vcf.gz	final vcf.gz files of all merged vcf files in this study
List_N_merging.txt	dataset_generation/data/vcf_merging/	List of the 1000 vcf.gz files to be merged together. There are 7 lists, numbered from 1 to 7
ecor72_array.txt	dataset_generation/results/ecor72_DP/	Consolidate DP information per nucleotide for each ECOR strain
variants_pos.tsv	dataset_analysis/data/variant_analysis/	List of all the variants found in the population and identified by their locus and position within the locus
allele_freqs.txt	Variant frequency informations
variants_non_silent_pos.tsv	List of all the non-silent variants found in the population and identified by their locus and position within the locus
allele_non_silent_freqs.txt	Non-silent variant frequency informations
cds_eggNog.tsv	eggNog output file of the pangenome annotation
COG_functional_categories.csv	Correspondance between COG functional categories and higher-order annotation
BVBRC_genome_May31.csv	dataset_analysis/data/dataset_analysis	List of E. coli with available genomes as reported in BCBRV database
BVBRC_genome_amr_May31.csv	E. coli antimicrobial resistance information available in BCBRV database
antibiotic_class.csv	Antibiotic name and Antibiotic class information
resistance_output.non_silent.vcf.gz	dataset_analysis/data/antimicrobial_resistance_analysis	vcf.gz file of the loci expected to be associated with antimicrobial resistance
antibiotic_resistance_freq.csv	Frequency information for the non-silent variant in the selected antimicrobial genes
SRA_to_genome_name.csv	correspondence between strain SRA accession number and genome name (as reported in BVBRC)
Dataset_metainfo_AMR_analysis.Rmd	dataset_analysis/scripts	R markdown: conducts the characterization of the population and analysis of the AMR phenotype distribution
Variant_population_analysis.Rmd	R markdown: conducts the analysis and investigation of identified variants in the population
Antimicrobial_resistance_investigation.Rmd	R markdown: conducts the antimicrobial resistance investigation

Files

ecoli_7000_amr.zip

Files (6.1 GB)

Name	Size	Download all
ecoli_7000_amr.zip md5:545f528ea3133892a471e40e88cea7e2	6.1 GB	Preview Download

	All versions	This version
Views	90	90
Downloads	27	27
Data volume	207.1 GB	207.1 GB

Creating a 7,000 strains genotype-phenotype dataset of E. coli and antimicrobial resistance phenotypes

Creators

Contributors

Project manager:

Description

Description

Work summary

Data organization

Files description

Files

ecoli_7000_amr.zip

Files (6.1 GB)