Data and results for reproducing the Entropy and ILS calculations described in "Assessing the potential of ancient protein sequences in the study of hominid evolution"  (https://www.biorxiv.org/content/10.1101/2025.04.08.647730v2.abstract).
The data present here include the Protein, Exon and Intron and Exon only alignments for 12 enamel and bone proteins (AHSG, ALB, AMBN,105AMELX, AMELY, AMTN, COL17A1, ENAM, MMP20, ODAM, COL1A1, COL1A2).

In detail, two main folders are found here:

a) Intro_Exon_Protein_Entropies_and_Alignments (with 3 subfolders)

** Homind_Data_Type_Entropy_Raw_Data. Contains all the raw fasta files in 3 subfolders, one for each data type: amino acids, exons-and-intons and exons-only. Each of the 3 subfolders contains the fasta sequences for multiple individuals from each of the following 4 hominid species: Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii.
** Homind_Data_Type_Entropy_ALIGNMENTS. Contains 3 fasta files for each of the 12 enamel and bone proteins-genes. Each of the 3 fasta files contains the alignment sequences of 4 hominid species (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii) of a specific data type: amino acids, exons-and-intons and exons-only. Additional files include the Calc_Entropy_of_Fasta_IEP.r R script, which can be used to calculate the entropy of an alignment and the Calc_Entropy_Introns_Exons_Proteins.sh bash files, which can be executed to automatically run the R script over all 12 genes.
** Entropy_Measurements. Contains 3 fasta files for each of the 12 enamel and bone proteins-genes. Each of the 3 fasta files contains the entropy score generated from the alignment of sequences of 4 hominid species (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii) of a specific data type: amino acids, exons-and-intons and exons-only. 


b) DNA_vs_Protein_Alignments (2 subfolders)

*DNA_Data_Hominid_Reference_Alignments. Contains 1 folder (GENE_TREES_ENAMEL) with the results of running a phylogenetic analysis on the DNA gene alignments each of the 4 hominid species (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii) and each of the 12 enamel and bone genes. Additonally the R script Tree_Dist.r can be used to compare the generated phylogenetic trees with a model reference tree (Hominid_Tree.txt) for these 4 species.
*Protein_Data_Hominid_Reference_Alignments. Contains 12 folders, one for each gene, with the phylogenetic results of the protein sequences of the 4 hominid species (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii). The phhylogenetic trees can be re-generated using the Generate_Trees_PhyML.sh bash script. Additonally the R script Tree_Dist.r can be used to compare the generated phylogenetic trees with a model reference tree (Hominid_Tree.txt) for these 4 species. The python script Topology_Per_Gene.py can be used to generate the main component of Figure 2 of the main text.

c) Protein_Entropy_Workflow (multiple subfolders)

** A third folder is also present. This folder contains the workflow for the more in-depth entropy and evolutionary rates calculations of the 12 enamel and collagen proteins.
The difference between these calculations and the ones of a)-Homind_Data_Type_Entropy_Raw_Data is that here, the calculations are berformed over dozens of individuals for each species
instead of a single representative. While there are multiple subfolders here, the most important on is the Snakefile described below.

*Snakefile. This file is a Snakemake script which can automatically reproduce the analysis and the results plotted in Figure 3 of the corresponding manuscript. Conda users can easily access all prerequisites for running this file using '' conda create -n  Entropy -c conda-forge -c bioconda biopython r-bio3d snakemake biopython '', activating the created conda environment and finally exectuing the Snakefile with '' snakemake -j8 -F ''.