Genes and sites under adaptation at the phylogenetic scale also exhibit adaptation at the population-genetic scale
- 1. Université de Lausanne
- 2. Department of Biology, Carleton University, Ottawa, Canada
- 3. Université de Lyon, CNRS, LBBE UMR 5558, Villeurbanne, France
Description
Published in:
Proceedings of the National Academy of Sciences, Volume 120, Issue 11, March 2023, Pages e2214977120,
https://doi.org/10.1073/pnas.2214977120
This Zenodo repository contains the mammalian dataset that can be used with the AdaptaPop pipeline. Scripts and instructions necessary to reproduce the empirical experiments are detailed in https://github.com/ThibaultLatrille/AdaptaPop.
I. The archive file OrthoMam.zip must be extracted inside the folder OrthoMam. It contains the input data at the mammalian scale(alignments, trees, annotations) and the output data (estimation of ω and ω0).
II. The archive file Polymorphism.zip must be extracted inside the folder Polymorphism, it contains the output data (vcf.gz and tsv.gz) for each population. Each vcf file contains SNPs for which is was possible to infer the ancestral and derived codon.
Once both OrthoMam.zip and Polymorphism.zip are extracted, it is possible to run the Snakemake inside the folder Contrasts that will contrast the rate of adaptation at the phylogenetic and population scale.
III. The archive file GeneTable.tsv is a tsv file containing ωAphy for each gene. The file contains the following columns:
- ENSG is the gene ID on Ensembl shared by all species (in the file name of OrthoMam alignment)
- ω_lower is the lower bound of the 95% posterior credible interval for ω.
- ω is the posterior mean for ω.
- ω_upper is the lower bound of the 95% posterior credible interval for ω.
- ω0_lower is the lower bound of the 95% posterior credible interval for ω0.
- ω0 is the posterior mean for ω0.
- ω0_upper is the lower bound of the 95% posterior credible interval for ω0.
- ωA_phy is the posterior mean for ωAphy.
- category is the classification of the gene (unclassified, nearly-neutral, adaptive).
- TRID is the transcript ID of the gene, specific to the focal species (found in the .xml files of OrthoMam).
IV. The archive file MK_statistics.gz contains a tsv file for every population allowing to compute ωA (McDonald & Kreitman) at the population level for each gene. Each tsv file contains the following columns:
- ENSG is the gene ID on Ensembl shared by all species (in the file name of OrthoMam alignment).
- NAME is the gene name shared by all species (in the file name of OrthoMam alignment).
- TRID is the transcript ID of the gene, specific to the focal species (found in the .xml files of OrthoMam).
- CHR is the chromosome on which the gene is located.
- STRAND is the strand on which the gene is located (+ if the same as the reference genome, - otherwise).
- L_non_syn is the number of non-synonymous sites on which the substitutions and polymorphisms are called.
- D_non_syn is the number of non-synonymous substitutions (can be 0).
- P_non_syn is the number of non-synonymous polymorphisms (can be 0).
- L_syn is the number of synonymous sites on which the substitutions and polymorphisms are called.
- D_syn is the number of synonymous substitutions (can be 0).
- P_syn is the number of synonymous polymorphisms (can be 0).
From these columns, one can compute for a group of genes (by summing over D, L and P):
- dN is computed as D_non_syn / L_non_syn.
- dS is computed as D_syn / L_syn.
- πN is computed as P_non_syn / L_non_syn.
- πS is computed as P_syn / L_syn.
- ωA is computed as dN/dS - πN/πS.
Files
MK_statistics.zip
Additional details
Funding
- NeGA – Influence of effective population size on animal genome architecture ANR-20-CE02-0008
- Agence Nationale de la Recherche
- DaSiRe – Exploring the Dark Side of Recombination ANR-15-CE12-0010
- Agence Nationale de la Recherche
- HotRec – Origin of PRDM9-dependent meiotic hotspots: where, how and why recombine? ANR-19-CE12-0019
- Agence Nationale de la Recherche