Protein language model embeddings and predictions of the human proteome

Christian Dallago; Burkhard Rost

doi:10.5281/zenodo.5047020

Published June 30, 2021 | Version 2021.06.09

Dataset Open

Protein language model embeddings and predictions of the human proteome

1. Technical University of Munich

Contributors

Contact person:

Christian Dallago¹

Supervisor:

Burkhard Rost¹

1. Technical University of Munich

Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on 2021.06.09) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).

Additionally:

- Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)

- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)

Files included:

- human.fasta --> FASTA-formatted sequences of human from SwissProt

- DSSP3_human_ProtT5Sec.fasta --> Secondary structure predictions in three states for each residue of each protein in human.fasta. "H" stands for Helix; "E" stands for Sheet; "C" stands for Other.

- subcell_human_LA_ProtT5.csv --> Subcellular location (10 states) and memrane-boundness (2 states) for each protein in human.fasta

- embeddings_file.h5 --> per-residue embeddings of sequences in human.fasta. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the "original_id" attribute. See https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file

- reduced_embeddings_file.h5 --> per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).

Files

subcell_human_LA_ProtT5.csv

Files (46.7 GB)

Name	Size	Download all
DSSP3_human_ProtT5Sec.fasta md5:63eb4cec5465cbfd4cac930fd5db6ee7	13.7 MB	Download
embeddings_file.h5 md5:372f4e7b6099288b816ffb73659e469d	46.6 GB	Download
human.fasta md5:313c10a2a28606ad36a85eeb73a5399d	13.6 MB	Download
reduced_embeddings_file.h5 md5:e8ae0c5a74bd13ba2e77d1d302dba083	92.0 MB	Download
subcell_human_LA_ProtT5.csv md5:fadaf9b20a0a70b175db77cb084a8b95	1.0 MB	Preview Download

Additional details

Is supplement to: Preprint: 10.1101/2020.07.12.199554 (DOI); Journal article: 10.1093/nar/gkab354/6276913 (DOI); Preprint: 10.1101/2021.04.25.441334 (DOI); Journal article: 10.1002/cpz1.113 (DOI)

	All versions	This version
Views	1,383	1,378
Downloads	986	981
Data volume	3.7 TB	3.7 TB

Protein language model embeddings and predictions of the human proteome

Creators

Contributors

Contact person:

Supervisor:

Description

Files

subcell_human_LA_ProtT5.csv

Files (46.7 GB)

Additional details

Related works