Hominid Palaeoproteomic Reference Dataset

Patramanis, Ioannis; Ramos Madrigal, Jazmin; Cappellini, Enrico; Racimo, Fernando

doi:10.5281/zenodo.7553844

Published November 23, 2022 | Version 0.1.6

Dataset Open

Hominid Palaeoproteomic Reference Dataset

1. Section for Molecular Ecology and Evolution , Globe Institute, University of Copenhagen
2. Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen
3. Section for GeoGenetics, Globe Institute, University of Copenhagen

This dataset contains the 'Hominid Palaeoproteomic Reference Dataset'.

We used PaleoProPhyler ( https://github.com/johnpatramanis/Proteomic_Pipeline ) to generate a palaeoproteomic reference dataset of protein sequences from ancient and present-day hominids. Using the first two modules of PaleoProPhyler, we translated 195 publicly available whole genomes from extant hominid groups. Details on the processing of the sequences can be found in the supplementary materials of PaleoProPhyler ( https://github.com/johnpatramanis/Proteomic_Pipeline/blob/main/GitHub_Tutorial/Supplementary.pdf ).

We also translated 8 ancient hominin genomes from VCF files, including those of several Neanderthals and one Denisovan. Since the dataset is tailored for palaeoproteomic analyses, we chose to translate proteins that have previously been reported as present in either teeth or bone tissue. We compiled a list of 1,696 proteins from previous works and successfully translated 1,543 of them. For each protein, both the canonical and all alternative protein coding isoforms were translated, leading to a total of 10,058 protein sequences for each individual in the dataset.

Content:

The zipped file contains 4 files, two fasta files as well as two additional folders:

- PalaeoProPhyler_Publication_Data_for_Tree.fa contains all of the sequences used to generate the phylogenetic tree presented at PalaeoProPhylers manuscript

- ALL_PROT_REFERENCE.fa contains all of the sequences generated as part of the Hominid Palaeoproteomic Reference Dataset described above

- PER_PROTEIN is a folder containing one fasta file for each protein within the Hominid Palaeoproteomic Reference Dataset, each protein fasta file has the sequences of all individuals for that particular protein

- PER_SAMPLE is a folder containing one fasta file for each sample/individual within the Hominid Palaeoproteomic Reference Dataset, each sample fasta file has the sequences of all proteins for that particular sample.

Important Note: The dataset is still not fully generated! However around half of the total number of samples are ready to go and in their appropriate folder (~/PER_SAMPLE /).

Files

Palaeoproteomic_Reference_Dataset.zip

Files (79.2 MB)

Name	Size	Download all
Palaeoproteomic_Reference_Dataset.zip md5:89af2a1f360ac39ec990fdc22c3cb100	79.2 MB	Preview Download

Additional details

PUSHH – Palaeoproteomics to Unleash Studies on Human History 861389: European Commission

	All versions	This version
Views	1,249	41
Downloads	210	6
Data volume	14.7 GB	475.4 MB

Hominid Palaeoproteomic Reference Dataset

Creators

Description

Files

Palaeoproteomic_Reference_Dataset.zip

Files (79.2 MB)

Additional details

Funding