Published June 30, 2021 | Version 2021.06.09
Dataset Open

Protein language model embeddings and predictions of the human proteome

  • 1. Technical University of Munich

Contributors

Contact person:

Supervisor:

  • 1. Technical University of Munich

Description

Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on 2021.06.09) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).

Additionally:

- Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)

- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)

 

Files included:

- human.fasta --> FASTA-formatted sequences of human from SwissProt

- DSSP3_human_ProtT5Sec.fasta --> Secondary structure predictions in three states for each residue of each protein in human.fasta. "H" stands for Helix; "E" stands for Sheet; "C" stands for Other.

- subcell_human_LA_ProtT5.csv --> Subcellular location (10 states) and memrane-boundness (2 states) for each protein in human.fasta

- embeddings_file.h5 --> per-residue embeddings of sequences in human.fasta. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the "original_id" attribute. See https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file

- reduced_embeddings_file.h5 --> per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).

Files

subcell_human_LA_ProtT5.csv

Files (46.7 GB)

Name Size Download all
md5:63eb4cec5465cbfd4cac930fd5db6ee7
13.7 MB Download
md5:372f4e7b6099288b816ffb73659e469d
46.6 GB Download
md5:313c10a2a28606ad36a85eeb73a5399d
13.6 MB Download
md5:e8ae0c5a74bd13ba2e77d1d302dba083
92.0 MB Download
md5:fadaf9b20a0a70b175db77cb084a8b95
1.0 MB Preview Download

Additional details

Related works

Is supplement to
Preprint: 10.1101/2020.07.12.199554 (DOI)
Journal article: 10.1093/nar/gkab354/6276913 (DOI)
Preprint: 10.1101/2021.04.25.441334 (DOI)
Journal article: 10.1002/cpz1.113 (DOI)