Protein and DNA alignments for ILS and Entropy calculations
Description
Data and results for reproducing the Entropy and ILS calculations described in "Assessing the potential of ancient protein sequences in the study of hominid evolution" (https://doi.org/10.1093/gbe/evag035).
The data present here include the Protein, Exon and Intron and Exon only alignments for 12 enamel and bone proteins (AHSG, ALB, AMBN, AMELX, AMELY, AMTN, COL17A1, ENAM, MMP20, ODAM, COL1A1, COL1A2).
In detail, two main folders are found here:
a) Intro_Exon_Protein_Entropies_and_Alignments (with 3 subfolders) ~ Corresponds to "Informational content: exons, introns and proteins" section of methods
- Homind_Data_Type_Entropy_Raw_Data. Contains all the raw fasta files in 3 subfolders, one for each data type: amino acids, exons-and-intons and exons-only. Each of the 3 subfolders contains the fasta sequences for multiple individuals from each of the following 4 hominid species: Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii.
- Homind_Data_Type_Entropy_ALIGNMENTS. Contains 3 fasta files for each of the 12 enamel and bone proteins-genes. Each of the 3 fasta files contains the alignment sequences of 4 hominid species (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii) of a specific data type: amino acids, exons-and-intons and exons-only. Additional files include the Calc_Entropy_of_Fasta_IEP.r R script, which can be used to calculate the entropy of an alignment and the Calc_Entropy_Introns_Exons_Proteins.sh bash files, which can be executed to automatically run the R script over all 12 genes.
- Entropy_Measurements. Contains 3 fasta files for each of the 12 enamel and bone proteins-genes. Each of the 3 fasta files contains the entropy score generated from the alignment of sequences of 4 hominid species (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii) of a specific data type: amino acids, exons-and-intons and exons-only.
b) DNA_vs_Protein_Alignments (2 subfolders) ~ Corresponds to "Incomplete lineage sorting, DNA and proteins" section of methods
- DNA_Data_Hominid_Reference_Alignments. Contains 1 folder (GENE_TREES_ENAMEL) with the results of running a phylogenetic analysis on the DNA gene alignments each of the 4 hominid species (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii) and each of the 12 enamel and bone genes. Additonally the R script Tree_Dist.r can be used to compare the generated phylogenetic trees with a model reference tree (Hominid_Tree.txt) for these 4 species.
- Protein_Data_Hominid_Reference_Alignments. Contains 12 folders, one for each gene, with the phylogenetic results of the protein sequences of the 4 hominid species (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii). The phhylogenetic trees can be re-generated using the Generate_Trees_PhyML.sh bash script. Additonally the R script Tree_Dist.r can be used to compare the generated phylogenetic trees with a model reference tree (Hominid_Tree.txt) for these 4 species. The python script Topology_Per_Gene.py can be used to generate the main component of Figure 2 of the main text.
c) Protein_Entropy_Workflow (multiple subfolders) ~ Corresponds to "Entropy and evolutionary conservation rates" section of methods
- A third folder is also present. This folder contains the workflow for the more in-depth entropy and evolutionary rates calculations of the 12 enamel and collagen proteins.
The difference between these calculations and the ones of a) Homind_Data_Type_Entropy_Raw_Data is that here, the calculations are berformed over dozens of individuals for each species instead of a single representative. While there are multiple subfolders here, the most important on is the Snakefile described below.
- Snakefile. This file is a Snakemake script which can automatically reproduce the analysis and the results plotted in Figure 3 of the corresponding manuscript. This file can be execute with '' snakemake -j8 -F '', provided that snakemake is installed on your machine.
NOTE ON PREREQUISITES
While a number of scripts need only a few prerequisites and can be run on any computer (e.g. R script or python scripts), I recommend using a conda environment on a Linux machine, which would allow the user to execute any file provided here. Conda users can easily access all prerequisites for running these files using '' conda create -n Entropy -c conda-forge -c bioconda biopython r-bio3d snakemake biopython '' and activating the created "Entropy" conda environment.
Files
Hominid_ProteinExon_Intron_ILS.zip
Files
(4.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:5f3e8d235a26d28676f8091ca2d7b13b
|
4.2 MB | Preview Download |
Additional details
Dates
- Available
-
2023-11-03Files were uploaded