Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes
Creators
- 1. Université de Paris, INSERM, IAME
- 2. Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne - Swiss Institute of Bioinformatics
- 3. Université de Paris, INSERM, IAME - Laboratoire de Bactériologie, Hôpital Bichat, APHP
- 4. Université de Paris, INSERM, IAME - Université Sorbonne Paris Nord
- 5. Sorbonne Université, CNRS, Institut de Biologie Paris Seine, LCQB
Description
We use computational models based on Direct Coupling Analysis - DCA - trained on PFAM domains of distant distant homologues to accurately predict the polymorphisms segregating in a panel of 61,157 Escherichia coli genomes.
We show that the genetic context (i.e. the rest of the protein sequence) strongly constrains the tolerable amino acids in 30% to 50% of amino-acid sites. Our study also suggests the gradual build-up of genetic context over long evolutionary timescales by the accumulation of small epistatic contributions.
Please refer to the README file for additional information on the structure of this dataset.
Code to analyse this dataset is available at https://github.com/GiancarloCroce/DCA_polymorphism_Ecoli.
Notes
Files
Files
(13.0 GB)
Name | Size | Download all |
---|---|---|
md5:fa121207cb51a9614f0352601b7a876e
|
13.0 GB | Download |
Additional details
Funding
- Agence Nationale de la Recherche
- GeWiEp - Bacterial genome wide epistasis: extant, emergence and molecular bases. ANR-18-CE35-0005