Protein haplotype sequences obtained by ProHap from the Haplotype Reference Consortium Release 1.1 dataset
Contributors
Project leader:
Project members:
Supervisors:
Description
Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the Haplotype Reference Consortium, Release 1.1 (https://ega-archive.org/datasets/EGAD00001002729). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts.
Release 1.1 of the HRC is provided aligned with the GRCh37 reference genome. We have performed a liftover to the GRCh38 reference using GeneBe (https://genebe.net/tools/liftover). Variants for which the reported alternative allele is considered as reference in GRCh38 were removed. A threshold of 1% minor allele frequency was applied to filter the remaining variants. After translation, a frequency threshold of 0.5% was applied to filter the resulting unique non-canonical sequences. The complete configuration file for the ProHap run is attached to this repository.
This dataset contains one compressed directory, contains the following files:
- F1: The concatenated fasta file ready to be used with search engines, contains the following:
- Protein haplotype sequences obtained by ProHap
- Reference proteome as per Ensembl v. 110
- Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)
- The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the fasta file.
- F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes
- F3: Translations of haplotype cDNA sequences, before merging with the reference proteome
For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.
For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.
When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0
Files
Files
(136.0 MB)
Name | Size | Download all |
---|---|---|
md5:96740e94a100738b5ac9bc4dc3af5362
|
136.0 MB | Download |
md5:9138e552459a211bdaa6f81d11300c18
|
1.3 kB | Download |
Additional details
Related works
- Is derived from
- Dataset: https://ega-archive.org/datasets/EGAD00001002729 (URL)
Funding
- Bioinformatics for Proteogenomics - looking up the answer in the back of the book 301178
- The Research Council of Norway
Dates
- Created
-
2024-07-03