Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set
Contributors
Project leader:
Project members:
Researcher:
Supervisors:
Description
Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.
This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:
- AFR - African
- AMR - American
- EUR - European
- SAS - South Asian
- EAS - East Asian
- ALL - all participants in the 1000 Genomes Project
Each of the directories contains the following files:
- F1: The concatenated fasta file ready to be used with search engines, contains the following:
- Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population
- Reference proteome as per Ensembl v. 110
- Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)
- The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.
- F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes
- F3: Translations of haplotype cDNA sequences, before merging with the reference proteome
For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.
For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.
When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0
Files
Files
(2.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:0979e06586bb480c54fbf63ed06d46ed
|
686.6 MB | Download |
|
md5:89f57415de9bfefc735af1c211db6de9
|
557.5 MB | Download |
|
md5:2c6d27c1c34db68397b3a3f5f61c426e
|
281.4 MB | Download |
|
md5:4793be833b2e5d8d469d427d105ee0c1
|
256.0 MB | Download |
|
md5:e4ec178f421b2a02ef8f1209dd077120
|
271.1 MB | Download |
|
md5:78902067745fd273934f462bb4200ffd
|
309.2 MB | Download |
|
md5:83902649eb7dec5b7996c5c568b1979a
|
1.4 kB | Download |
|
md5:634a9a52309fe3b8bafce52a7159b816
|
1.4 kB | Download |
|
md5:f0f2a1ac1b895a03212bddc192e04a57
|
1.4 kB | Download |
|
md5:04151814ffbea172012ecc73af1c1a6f
|
1.4 kB | Download |
|
md5:12f6fcb28870e0f5332df5b3c246f988
|
1.4 kB | Download |
|
md5:4bcc4aad8f9b2502646e9b659dc323ae
|
1.4 kB | Download |
Additional details
Related works
- Is derived from
- Dataset: https://www.internationalgenome.org/data-portal/data-collection/grch38 (URL)
Funding
- The Research Council of Norway
- Bioinformatics for Proteogenomics - looking up the answer in the back of the book 301178
Dates
- Created
-
2023-12-20
- Updated
-
2024-02-16
- Updated
-
2024-05-30