Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set

Vasicek, Jakub

doi:10.5281/zenodo.12671237

Published July 8, 2024 | Version 1.1.0

Dataset Open

Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set

Vasicek, Jakub (Researcher)¹

1. University of Bergen

Contributors

Project leader:

Vaudel, Marc^{1, 4}

Project members:

Researcher:

Vašíček, Jakub¹

Supervisors:

1. University of Bergen
2. University of Rostock
3. KTH Royal Institute of Technology
4. Norwegian Institute of Public Health

Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.

This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:

AFR - African
AMR - American
EUR - European
SAS - South Asian
EAS - East Asian
ALL - all participants in the 1000 Genomes Project

Each of the directories contains the following files:

F1: The concatenated fasta file ready to be used with search engines, contains the following:
- Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population
- Reference proteome as per Ensembl v. 110
- Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)
- The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.
F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes
F3: Translations of haplotype cDNA sequences, before merging with the reference proteome

For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.

For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.

When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0

Files

Files (2.4 GB)

Name	Size	Download all
240527_ProHap_ALL.tar.gz md5:0979e06586bb480c54fbf63ed06d46ed	686.6 MB	Download
240530_ProHap_AFR.tar.gz md5:89f57415de9bfefc735af1c211db6de9	557.5 MB	Download
240530_ProHap_AMR.tar.gz md5:2c6d27c1c34db68397b3a3f5f61c426e	281.4 MB	Download
240530_ProHap_EAS.tar.gz md5:4793be833b2e5d8d469d427d105ee0c1	256.0 MB	Download
240530_ProHap_EUR.tar.gz md5:e4ec178f421b2a02ef8f1209dd077120	271.1 MB	Download
240530_ProHap_SAS.tar.gz md5:78902067745fd273934f462bb4200ffd	309.2 MB	Download
config_AFR_240530.yaml md5:83902649eb7dec5b7996c5c568b1979a	1.4 kB	Download
config_ALL_240527.yaml md5:634a9a52309fe3b8bafce52a7159b816	1.4 kB	Download
config_AMR_240530.yaml md5:f0f2a1ac1b895a03212bddc192e04a57	1.4 kB	Download
config_EAS_240530.yaml md5:04151814ffbea172012ecc73af1c1a6f	1.4 kB	Download
config_EUR_240530.yaml md5:12f6fcb28870e0f5332df5b3c246f988	1.4 kB	Download
config_SAS_240530.yaml md5:4bcc4aad8f9b2502646e9b659dc323ae	1.4 kB	Download

Additional details

Is derived from: Dataset: https://www.internationalgenome.org/data-portal/data-collection/grch38 (URL)

The Research Council of Norway
Bioinformatics for Proteogenomics - looking up the answer in the back of the book 301178

Created: 2023-12-20
Updated: 2024-02-16
Updated: 2024-05-30

	All versions	This version
Views	464	255
Downloads	707	406
Data volume	152.3 GB	89.4 GB

Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set

Contributors

Project leader:

Project members:

Researcher:

Supervisors:

Files

Files (2.4 GB)

Additional details

Related works

Funding

Dates

Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set

Creators

Contributors

Project leader:

Project members:

Researcher:

Supervisors:

Description

Files

Files (2.4 GB)

Additional details

Related works

Funding

Dates