Published December 11, 2021 | Version 1
Dataset Open

Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes

  • 1. Université de Paris, INSERM, IAME
  • 2. Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne - Swiss Institute of Bioinformatics
  • 3. Université de Paris, INSERM, IAME - Laboratoire de Bactériologie, Hôpital Bichat, APHP
  • 4. Université de Paris, INSERM, IAME - Université Sorbonne Paris Nord
  • 5. Sorbonne Université, CNRS, Institut de Biologie Paris Seine, LCQB

Description

We use computational models based on Direct Coupling Analysis - DCA - trained on PFAM domains of distant distant homologues to accurately predict the polymorphisms segregating in a panel of 61,157 Escherichia coli genomes.

We show that the genetic context (i.e. the rest of the protein sequence) strongly constrains the tolerable amino acids in 30% to 50% of amino-acid sites. Our study also suggests the gradual build-up of genetic context over long evolutionary timescales by the accumulation of small epistatic contributions.

Please refer to the README file for additional information on the structure of this dataset.

Code to analyse this dataset is available at https://github.com/GiancarloCroce/DCA_polymorphism_Ecoli.

 

Notes

Our work was partially funded by the French Agence Nationale pour la Recherche ANR GeWiEp (ANR-18-CE35-0005-01, to L.V. and O.T.), the French Fondation pour la Recherche Médicale (EQU201903007848, to L.V. and O.T.), the PhD program AMX of École polytechnique and Min- istère de l'Enseignement Supérieur, de la Recherche et de l'Innovation (to L.V.) and EU H2020 Research and Innovation Programme MSCA-RISE-2016 (Grant Agreement No. 734439 InferNet, to M.W.).

Files

Files (13.0 GB)

Name Size Download all
md5:fa121207cb51a9614f0352601b7a876e
13.0 GB Download

Additional details

Funding

Agence Nationale de la Recherche
GeWiEp - Bacterial genome wide epistasis: extant, emergence and molecular bases. ANR-18-CE35-0005