Creating a 7,000 strains genotype-phenotype dataset of E. coli and antimicrobial resistance phenotypes
Creators
Contributors
Project manager:
Description
Description
This Zenodo repository contains the data (except for the input fastq files available on SRA and intermediary files generated during the variant calling process) and code to recapitulate the study from https://doi.org/10.57844/arcadia-d2cf-ebe5 and the associated GitHub repository, where the code, pipelines, and analysis are described in more detail.
Work summary
In this work, we established a framework for compiling large genotype-phenotype datasets and produced a large-scale dataset of more than 7,000 E. coli strains and antimicrobial resistance phenotypes.
We leveraged the genetic information and antimicrobial resistance (AMR) phenotype data available for the bacterium Escherichia coli to construct our dataset and took advantage of the existing knowledge about genetic variations and AMR phenotypes to validate our approach and dataset. We performed variant calling and compiled a genotype-phenotype dataset for more than 7,000 E. coli strains. Briefly, variant calling consists of identifying all genetic variations and their associated genotypes in a population compared to a reference genome. This is performed by aligning sequencing reads for each strain of the population against a reference genome, then identifying polymorphic regions in the population, and finally characterizing variants and their genotypes at each of these polymorphic regions.
We have generated a dataset that successfully revealed significant genetic diversity and identified 2.4 million variants. By focusing on non-silent variants within genes associated with AMR, we confirmed the dataset's accuracy.
We hope this study is a foundational resource for conducting large-scale genotype-phenotype studies that will offer valuable insights for genetics investigations, informing the development of treatments and prevention strategies for AMR. This resource is invaluable for microbiologists and epidemiologists seeking to understand AMR mechanisms and improve genotype-phenotype predictions in pathogenic E. coli outbreaks. Additionally, it's of particular interest to geneticists and evolutionary biologists, providing a dataset to develop strategies for studying genetic interactions and broader applications in phenotype-phenotype predictions and phylogenetic research.
Data organization
Data are organized in the compressed folder. Briefly, they’re divided into two main folders.
The first folder, dataset_generation, includes the code and information necessary to build the genotype dataset and perform the variant calling. It covers major steps like the generation of the reference pangenome used for variant calling, the variant calling pipeline applied to each of the 7,000 strains, the filtering of false positive variants, and the annotation of the variants.
The second section, dataset_analysis, includes the code and information used to process and analyze the dataset and generate figures for the Pub (https://doi.org/10.57844/arcadia-d2cf-ebe5). It includes the preliminary analysis of AMR phenotypes within the population and the analysis of variants regarding known AMR phenotypes.
Files description
The following table provides a list and description of the different files and their locations.
File name | Location | Description |
variant_calling_pipeline | dataset_generation/scripts/ | Snakefile: performs variant calling from raw paired-end sequencing files and generate one vcf.gz file per sample |
snakemake_ECOR72_annotation | Snakefile: performs Prokka annotation on inputs whole genome fastq files | |
ECOR72_and_DP_threshold_analysis.Rmd | R markdown: analyses the coverage of known present and absebt loci in the ECOR population | |
average_coverage_41.csv | dataset_generation/data/dp_threshold/ | Pangenome loci read coverage information for 40 ECOR strains |
average_coverage_last32.csv | Pangenome loci read coverage information for 32 ECOR strains | |
whole_pan_ecor_presence_absence.csv | Reformated pangenome loci presence-absence in ECOR strains | |
pangenome_genomes_SRA_GCA.csv | Correspondance table between ECOR72 strains genome names and raw sequencing files SRA accession number | |
index_loci_pangenome_good.txt | List of indexed positions in the pangenome | |
list_ecor_txtfiles.txt | List of txt files (containing the DP information per nucleotide) to use - This corresponds to the files for each 72 ECOR strains | |
ECOR72_SRA_and_assembly_accessions.csv | dataset_generation/data/ | List of Genome accession number and the SRA accession number of the associated sequencing files for the 72 ECOR strains |
sample_list_SRA.csv | List of SRA accession numbers of the E. coli strains used for variant calling | |
gene_presence_absence.csv | dataset_generation/results/pangenome_cds/ | Roary output of presence-absence of the pangenome cds loci in the ECOR72 strains |
genes.gff | Annotation file of the pangenome cds sequences (Prokka output) | |
pangenome_cds.fa | Roary output cds_pangenome sequencing file | |
summary_statistics.txt | Roary statistics output of creations of the cds pangenome | |
roary_output | Roary output folder | |
IGR_presence_absence.csv | dataset_generation/results/pangenome_igr/ | Piggy output of presence-absence of the pangenome igr loci in the ECOR72 strains |
pangenome_igr.fasta | Piggy output igr_pangenome sequencing file | |
piggy_output | Piggy output folder | |
whole_pangenome.fasta | dataset_generation/results/pangenome_whole/ | whole pangenome sequences |
annot_summary_filtered.html | dataset_generation/results/vcf/ | Summary of snpEff annotations |
annotated_output.vcf.gz | snpEff annotated vcf file | |
annotated_output.vcf.gz.csi | indexed annotated vcf file | |
filtered_output.vcf.gz | filtered vcf file (removed low coveraged and low quality variants) | |
filtered_output.vcf.gz.csi | indexed filtered vcf files | |
output.non_silent.vcf.gz | vcf file containing only the nonsilent variants in the pangenome cds loci | |
merged_output_listN.vcg.gz | intermediary vcf files of 1000 merged strains vcf - these intermediary merged files are numbered from 1 to 7 | |
merged_output_all.vcf.gz | final vcf.gz files of all merged vcf files in this study | |
List_N_merging.txt | dataset_generation/data/vcf_merging/ | List of the 1000 vcf.gz files to be merged together. There are 7 lists, numbered from 1 to 7 |
ecor72_array.txt | dataset_generation/results/ecor72_DP/ | Consolidate DP information per nucleotide for each ECOR strain |
variants_pos.tsv | dataset_analysis/data/variant_analysis/ | List of all the variants found in the population and identified by their locus and position within the locus |
allele_freqs.txt | Variant frequency informations | |
variants_non_silent_pos.tsv | List of all the non-silent variants found in the population and identified by their locus and position within the locus | |
allele_non_silent_freqs.txt | Non-silent variant frequency informations | |
cds_eggNog.tsv | eggNog output file of the pangenome annotation | |
COG_functional_categories.csv | Correspondance between COG functional categories and higher-order annotation | |
BVBRC_genome_May31.csv | dataset_analysis/data/dataset_analysis | List of E. coli with available genomes as reported in BCBRV database |
BVBRC_genome_amr_May31.csv | E. coli antimicrobial resistance information available in BCBRV database | |
antibiotic_class.csv | Antibiotic name and Antibiotic class information | |
resistance_output.non_silent.vcf.gz | dataset_analysis/data/antimicrobial_resistance_analysis | vcf.gz file of the loci expected to be associated with antimicrobial resistance |
antibiotic_resistance_freq.csv | Frequency information for the non-silent variant in the selected antimicrobial genes | |
SRA_to_genome_name.csv | correspondence between strain SRA accession number and genome name (as reported in BVBRC) | |
Dataset_metainfo_AMR_analysis.Rmd | dataset_analysis/scripts | R markdown: conducts the characterization of the population and analysis of the AMR phenotype distribution |
Variant_population_analysis.Rmd | R markdown: conducts the analysis and investigation of identified variants in the population | |
Antimicrobial_resistance_investigation.Rmd | R markdown: conducts the antimicrobial resistance investigation |
Files
ecoli_7000_amr.zip
Files
(6.1 GB)
Name | Size | Download all |
---|---|---|
md5:545f528ea3133892a471e40e88cea7e2
|
6.1 GB | Preview Download |