Dataset Open Access

Protein language model embeddings and predictions of the human proteome

Christian Dallago; Burkhard Rost


JSON-LD (schema.org) Export

{
  "description": "<p>Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on&nbsp;2021.06.09)&nbsp;computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).</p>\n\n<p>Additionally:</p>\n\n<p>- Sequence-level&nbsp;predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)</p>\n\n<p>- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported&nbsp;in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)</p>\n\n<p>&nbsp;</p>\n\n<p>Files included:</p>\n\n<p>- human.fasta --&gt; FASTA-formatted sequences of human from SwissProt</p>\n\n<p>-&nbsp;DSSP3_human_ProtT5Sec.fasta --&gt; Secondary structure predictions in three states for each residue of each protein&nbsp;in human.fasta. &quot;H&quot; stands for Helix; &quot;E&quot; stands for Sheet; &quot;C&quot; stands for Other.</p>\n\n<p>-&nbsp;subcell_human_LA_ProtT5.csv --&gt; Subcellular location (10 states) and memrane-boundness (2 states)&nbsp;for each protein in human.fasta</p>\n\n<p>-&nbsp;embeddings_file.h5 --&gt; per-residue embeddings of sequences in human.fasta. Each dataset&nbsp;in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the &quot;original_id&quot; attribute. See&nbsp;https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file</p>\n\n<p>-&nbsp;reduced_embeddings_file.h5 --&gt; per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset&nbsp;in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).</p>", 
  "license": "https://opensource.org/licenses/afl-3.0", 
  "creator": [
    {
      "affiliation": "Technical University of Munich", 
      "@id": "https://orcid.org/0000-0003-4650-6181", 
      "@type": "Person", 
      "name": "Christian Dallago"
    }, 
    {
      "affiliation": "Technical University of Munich", 
      "@id": "https://orcid.org/0000-0003-0179-8424", 
      "@type": "Person", 
      "name": "Burkhard Rost"
    }
  ], 
  "url": "https://zenodo.org/record/5047020", 
  "datePublished": "2021-06-30", 
  "keywords": [
    "protein subcellular location", 
    "protein embeddings", 
    "protein language models", 
    "protein secondary structure", 
    "protein prediction", 
    "human proteome"
  ], 
  "version": "2021.06.09", 
  "contributor": [
    {
      "affiliation": "Technical University of Munich", 
      "@id": "https://orcid.org/0000-0003-4650-6181", 
      "@type": "Person", 
      "name": "Christian Dallago"
    }, 
    {
      "affiliation": "Technical University of Munich", 
      "@id": "https://orcid.org/0000-0003-0179-8424", 
      "@type": "Person", 
      "name": "Burkhard Rost"
    }
  ], 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/DSSP3_human_ProtT5Sec.fasta", 
      "encodingFormat": "fasta", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/embeddings_file.h5", 
      "encodingFormat": "h5", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/human.fasta", 
      "encodingFormat": "fasta", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/reduced_embeddings_file.h5", 
      "encodingFormat": "h5", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/subcell_human_LA_ProtT5.csv", 
      "encodingFormat": "csv", 
      "@type": "DataDownload"
    }
  ], 
  "identifier": "https://doi.org/10.5281/zenodo.5047020", 
  "@id": "https://doi.org/10.5281/zenodo.5047020", 
  "@type": "Dataset", 
  "name": "Protein language model embeddings and predictions of the human proteome"
}
184
114
views
downloads
All versions This version
Views 184184
Downloads 114114
Data volume 93.9 GB93.9 GB
Unique views 163163
Unique downloads 9090

Share

Cite as