Published July 6, 2024 | Version 1.0.0
Dataset Open

Protein haplotype sequences obtained by ProHap from the Haplotype Reference Consortium Release 1.1 dataset

  • 1. ROR icon University of Bergen
  • 1. ROR icon University of Bergen
  • 2. ROR icon University of Rostock
  • 3. ROR icon KTH Royal Institute of Technology
  • 4. ROR icon Norwegian Institute of Public Health

Description

Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the Haplotype Reference Consortium, Release 1.1 (https://ega-archive.org/datasets/EGAD00001002729). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts.

Release 1.1 of the HRC is provided aligned with the GRCh37 reference genome. We have performed a liftover to the GRCh38 reference using GeneBe (https://genebe.net/tools/liftover). Variants for which the reported alternative allele is considered as reference in GRCh38 were removed. A threshold of 1% minor allele frequency was applied to filter the remaining variants. After translation, a frequency threshold of 0.5% was applied to filter the resulting unique non-canonical sequences. The complete configuration file for the ProHap run is attached to this repository.

This dataset contains one compressed directory, contains the following files:

  • F1: The concatenated fasta file ready to be used with search engines, contains the following:
    • Protein haplotype sequences obtained by ProHap
    • Reference proteome as per Ensembl v. 110
    • Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)
    • The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the fasta file. 
  • F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes
  • F3: Translations of haplotype cDNA sequences, before merging with the reference proteome

For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.

For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.

When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0

Files

Files (136.0 MB)

Name Size Download all
md5:96740e94a100738b5ac9bc4dc3af5362
136.0 MB Download
md5:9138e552459a211bdaa6f81d11300c18
1.3 kB Download

Additional details

Related works

Funding

Bioinformatics for Proteogenomics - looking up the answer in the back of the book 301178
The Research Council of Norway

Dates

Created
2024-07-03