Dataset Open Access

Protein language model embeddings and predictions of the human proteome

Christian Dallago; Burkhard Rost


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:contributor>Christian Dallago</dc:contributor>
  <dc:contributor>Burkhard Rost</dc:contributor>
  <dc:creator>Christian Dallago</dc:creator>
  <dc:creator>Burkhard Rost</dc:creator>
  <dc:date>2021-06-30</dc:date>
  <dc:description>Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on 2021.06.09) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).

Additionally:

- Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)

- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)

 

Files included:

- human.fasta --&gt; FASTA-formatted sequences of human from SwissProt

- DSSP3_human_ProtT5Sec.fasta --&gt; Secondary structure predictions in three states for each residue of each protein in human.fasta. "H" stands for Helix; "E" stands for Sheet; "C" stands for Other.

- subcell_human_LA_ProtT5.csv --&gt; Subcellular location (10 states) and memrane-boundness (2 states) for each protein in human.fasta

- embeddings_file.h5 --&gt; per-residue embeddings of sequences in human.fasta. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the "original_id" attribute. See https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file

- reduced_embeddings_file.h5 --&gt; per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).</dc:description>
  <dc:identifier>https://zenodo.org/record/5047020</dc:identifier>
  <dc:identifier>10.5281/zenodo.5047020</dc:identifier>
  <dc:identifier>oai:zenodo.org:5047020</dc:identifier>
  <dc:relation>doi:10.1101/2020.07.12.199554</dc:relation>
  <dc:relation>doi:10.1093/nar/gkab354/6276913</dc:relation>
  <dc:relation>doi:10.1101/2021.04.25.441334</dc:relation>
  <dc:relation>doi:10.1002/cpz1.113</dc:relation>
  <dc:relation>doi:10.5281/zenodo.5047019</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>https://opensource.org/licenses/afl-3.0</dc:rights>
  <dc:subject>protein subcellular location</dc:subject>
  <dc:subject>protein embeddings</dc:subject>
  <dc:subject>protein language models</dc:subject>
  <dc:subject>protein secondary structure</dc:subject>
  <dc:subject>protein prediction</dc:subject>
  <dc:subject>human proteome</dc:subject>
  <dc:title>Protein language model embeddings and predictions of the human proteome</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>dataset</dc:type>
</oai_dc:dc>
184
114
views
downloads
All versions This version
Views 184184
Downloads 114114
Data volume 93.9 GB93.9 GB
Unique views 163163
Unique downloads 9090

Share

Cite as