Dataset Open Access

Protein language model embeddings and predictions of the human proteome

Christian Dallago; Burkhard Rost


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.5047020", 
  "title": "Protein language model embeddings and predictions of the human proteome", 
  "issued": {
    "date-parts": [
      [
        2021, 
        6, 
        30
      ]
    ]
  }, 
  "abstract": "<p>Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on&nbsp;2021.06.09)&nbsp;computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).</p>\n\n<p>Additionally:</p>\n\n<p>- Sequence-level&nbsp;predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)</p>\n\n<p>- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported&nbsp;in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)</p>\n\n<p>&nbsp;</p>\n\n<p>Files included:</p>\n\n<p>- human.fasta --&gt; FASTA-formatted sequences of human from SwissProt</p>\n\n<p>-&nbsp;DSSP3_human_ProtT5Sec.fasta --&gt; Secondary structure predictions in three states for each residue of each protein&nbsp;in human.fasta. &quot;H&quot; stands for Helix; &quot;E&quot; stands for Sheet; &quot;C&quot; stands for Other.</p>\n\n<p>-&nbsp;subcell_human_LA_ProtT5.csv --&gt; Subcellular location (10 states) and memrane-boundness (2 states)&nbsp;for each protein in human.fasta</p>\n\n<p>-&nbsp;embeddings_file.h5 --&gt; per-residue embeddings of sequences in human.fasta. Each dataset&nbsp;in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the &quot;original_id&quot; attribute. See&nbsp;https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file</p>\n\n<p>-&nbsp;reduced_embeddings_file.h5 --&gt; per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset&nbsp;in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).</p>", 
  "author": [
    {
      "family": "Christian Dallago"
    }, 
    {
      "family": "Burkhard Rost"
    }
  ], 
  "version": "2021.06.09", 
  "type": "dataset", 
  "id": "5047020"
}
184
114
views
downloads
All versions This version
Views 184184
Downloads 114114
Data volume 93.9 GB93.9 GB
Unique views 163163
Unique downloads 9090

Share

Cite as