Dataset Open Access

Protein language model embeddings and predictions of the human proteome

Christian Dallago; Burkhard Rost


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/DSSP3_human_ProtT5Sec.fasta"
      }, 
      "checksum": "md5:63eb4cec5465cbfd4cac930fd5db6ee7", 
      "bucket": "f6151fc6-f924-4043-b087-1661627f635b", 
      "key": "DSSP3_human_ProtT5Sec.fasta", 
      "type": "fasta", 
      "size": 13722773
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/embeddings_file.h5"
      }, 
      "checksum": "md5:372f4e7b6099288b816ffb73659e469d", 
      "bucket": "f6151fc6-f924-4043-b087-1661627f635b", 
      "key": "embeddings_file.h5", 
      "type": "h5", 
      "size": 46557163296
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/human.fasta"
      }, 
      "checksum": "md5:313c10a2a28606ad36a85eeb73a5399d", 
      "bucket": "f6151fc6-f924-4043-b087-1661627f635b", 
      "key": "human.fasta", 
      "type": "fasta", 
      "size": 13611513
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/reduced_embeddings_file.h5"
      }, 
      "checksum": "md5:e8ae0c5a74bd13ba2e77d1d302dba083", 
      "bucket": "f6151fc6-f924-4043-b087-1661627f635b", 
      "key": "reduced_embeddings_file.h5", 
      "type": "h5", 
      "size": 92008224
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b/subcell_human_LA_ProtT5.csv"
      }, 
      "checksum": "md5:fadaf9b20a0a70b175db77cb084a8b95", 
      "bucket": "f6151fc6-f924-4043-b087-1661627f635b", 
      "key": "subcell_human_LA_ProtT5.csv", 
      "type": "csv", 
      "size": 1020481
    }
  ], 
  "owners": [
    36297
  ], 
  "doi": "10.5281/zenodo.5047020", 
  "stats": {
    "version_unique_downloads": 90.0, 
    "unique_views": 163.0, 
    "views": 184.0, 
    "version_views": 184.0, 
    "unique_downloads": 90.0, 
    "version_unique_views": 163.0, 
    "volume": 93913604574.0, 
    "version_downloads": 114.0, 
    "downloads": 114.0, 
    "version_volume": 93913604574.0
  }, 
  "links": {
    "doi": "https://doi.org/10.5281/zenodo.5047020", 
    "conceptdoi": "https://doi.org/10.5281/zenodo.5047019", 
    "bucket": "https://zenodo.org/api/files/f6151fc6-f924-4043-b087-1661627f635b", 
    "conceptbadge": "https://zenodo.org/badge/doi/10.5281/zenodo.5047019.svg", 
    "html": "https://zenodo.org/record/5047020", 
    "latest_html": "https://zenodo.org/record/5047020", 
    "badge": "https://zenodo.org/badge/doi/10.5281/zenodo.5047020.svg", 
    "latest": "https://zenodo.org/api/records/5047020"
  }, 
  "conceptdoi": "10.5281/zenodo.5047019", 
  "created": "2021-06-30T20:49:55.629151+00:00", 
  "updated": "2021-07-01T01:48:19.176382+00:00", 
  "conceptrecid": "5047019", 
  "revision": 3, 
  "id": 5047020, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.5281/zenodo.5047020", 
    "description": "<p>Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on&nbsp;2021.06.09)&nbsp;computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).</p>\n\n<p>Additionally:</p>\n\n<p>- Sequence-level&nbsp;predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)</p>\n\n<p>- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported&nbsp;in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)</p>\n\n<p>&nbsp;</p>\n\n<p>Files included:</p>\n\n<p>- human.fasta --&gt; FASTA-formatted sequences of human from SwissProt</p>\n\n<p>-&nbsp;DSSP3_human_ProtT5Sec.fasta --&gt; Secondary structure predictions in three states for each residue of each protein&nbsp;in human.fasta. &quot;H&quot; stands for Helix; &quot;E&quot; stands for Sheet; &quot;C&quot; stands for Other.</p>\n\n<p>-&nbsp;subcell_human_LA_ProtT5.csv --&gt; Subcellular location (10 states) and memrane-boundness (2 states)&nbsp;for each protein in human.fasta</p>\n\n<p>-&nbsp;embeddings_file.h5 --&gt; per-residue embeddings of sequences in human.fasta. Each dataset&nbsp;in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the &quot;original_id&quot; attribute. See&nbsp;https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file</p>\n\n<p>-&nbsp;reduced_embeddings_file.h5 --&gt; per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset&nbsp;in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).</p>", 
    "contributors": [
      {
        "orcid": "0000-0003-4650-6181", 
        "affiliation": "Technical University of Munich", 
        "type": "ContactPerson", 
        "name": "Christian Dallago"
      }, 
      {
        "orcid": "0000-0003-0179-8424", 
        "affiliation": "Technical University of Munich", 
        "type": "Supervisor", 
        "name": "Burkhard Rost"
      }
    ], 
    "title": "Protein language model embeddings and predictions of the human proteome", 
    "license": {
      "id": "AFL-3.0"
    }, 
    "relations": {
      "version": [
        {
          "count": 1, 
          "index": 0, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "5047019"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "5047020"
          }
        }
      ]
    }, 
    "version": "2021.06.09", 
    "keywords": [
      "protein subcellular location", 
      "protein embeddings", 
      "protein language models", 
      "protein secondary structure", 
      "protein prediction", 
      "human proteome"
    ], 
    "publication_date": "2021-06-30", 
    "creators": [
      {
        "orcid": "0000-0003-4650-6181", 
        "affiliation": "Technical University of Munich", 
        "name": "Christian Dallago"
      }, 
      {
        "orcid": "0000-0003-0179-8424", 
        "affiliation": "Technical University of Munich", 
        "name": "Burkhard Rost"
      }
    ], 
    "access_right": "open", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }, 
    "related_identifiers": [
      {
        "scheme": "doi", 
        "identifier": "10.1101/2020.07.12.199554", 
        "relation": "isSupplementTo", 
        "resource_type": "publication-preprint"
      }, 
      {
        "scheme": "doi", 
        "identifier": "10.1093/nar/gkab354/6276913", 
        "relation": "isSupplementTo", 
        "resource_type": "publication-article"
      }, 
      {
        "scheme": "doi", 
        "identifier": "10.1101/2021.04.25.441334", 
        "relation": "isSupplementTo", 
        "resource_type": "publication-preprint"
      }, 
      {
        "scheme": "doi", 
        "identifier": "10.1002/cpz1.113", 
        "relation": "isSupplementTo", 
        "resource_type": "publication-article"
      }, 
      {
        "scheme": "doi", 
        "identifier": "10.5281/zenodo.5047019", 
        "relation": "isVersionOf"
      }
    ]
  }
}
184
114
views
downloads
All versions This version
Views 184184
Downloads 114114
Data volume 93.9 GB93.9 GB
Unique views 163163
Unique downloads 9090

Share

Cite as