Dataset Open Access

Protein language model embeddings and predictions of the human proteome

Christian Dallago; Burkhard Rost


DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
  <identifier identifierType="DOI">10.5281/zenodo.5047020</identifier>
  <creators>
    <creator>
      <creatorName>Christian Dallago</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-4650-6181</nameIdentifier>
      <affiliation>Technical University of Munich</affiliation>
    </creator>
    <creator>
      <creatorName>Burkhard Rost</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-0179-8424</nameIdentifier>
      <affiliation>Technical University of Munich</affiliation>
    </creator>
  </creators>
  <titles>
    <title>Protein language model embeddings and predictions of the human proteome</title>
  </titles>
  <publisher>Zenodo</publisher>
  <publicationYear>2021</publicationYear>
  <subjects>
    <subject>protein subcellular location</subject>
    <subject>protein embeddings</subject>
    <subject>protein language models</subject>
    <subject>protein secondary structure</subject>
    <subject>protein prediction</subject>
    <subject>human proteome</subject>
  </subjects>
  <dates>
    <date dateType="Issued">2021-06-30</date>
  </dates>
  <resourceType resourceTypeGeneral="Dataset"/>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/5047020</alternateIdentifier>
  </alternateIdentifiers>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsSupplementTo" resourceTypeGeneral="Preprint">10.1101/2020.07.12.199554</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsSupplementTo" resourceTypeGeneral="JournalArticle">10.1093/nar/gkab354/6276913</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsSupplementTo" resourceTypeGeneral="Preprint">10.1101/2021.04.25.441334</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsSupplementTo" resourceTypeGeneral="JournalArticle">10.1002/cpz1.113</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.5047019</relatedIdentifier>
  </relatedIdentifiers>
  <version>2021.06.09</version>
  <rightsList>
    <rights rightsURI="https://opensource.org/licenses/afl-3.0">Academic Free License v3.0</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">&lt;p&gt;Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on&amp;nbsp;2021.06.09)&amp;nbsp;computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).&lt;/p&gt;

&lt;p&gt;Additionally:&lt;/p&gt;

&lt;p&gt;- Sequence-level&amp;nbsp;predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)&lt;/p&gt;

&lt;p&gt;- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported&amp;nbsp;in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;Files included:&lt;/p&gt;

&lt;p&gt;- human.fasta --&amp;gt; FASTA-formatted sequences of human from SwissProt&lt;/p&gt;

&lt;p&gt;-&amp;nbsp;DSSP3_human_ProtT5Sec.fasta --&amp;gt; Secondary structure predictions in three states for each residue of each protein&amp;nbsp;in human.fasta. &amp;quot;H&amp;quot; stands for Helix; &amp;quot;E&amp;quot; stands for Sheet; &amp;quot;C&amp;quot; stands for Other.&lt;/p&gt;

&lt;p&gt;-&amp;nbsp;subcell_human_LA_ProtT5.csv --&amp;gt; Subcellular location (10 states) and memrane-boundness (2 states)&amp;nbsp;for each protein in human.fasta&lt;/p&gt;

&lt;p&gt;-&amp;nbsp;embeddings_file.h5 --&amp;gt; per-residue embeddings of sequences in human.fasta. Each dataset&amp;nbsp;in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the &amp;quot;original_id&amp;quot; attribute. See&amp;nbsp;https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file&lt;/p&gt;

&lt;p&gt;-&amp;nbsp;reduced_embeddings_file.h5 --&amp;gt; per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset&amp;nbsp;in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).&lt;/p&gt;</description>
  </descriptions>
</resource>
184
114
views
downloads
All versions This version
Views 184184
Downloads 114114
Data volume 93.9 GB93.9 GB
Unique views 163163
Unique downloads 9090

Share

Cite as