Dataset Open Access

Protein language model embeddings and predictions of the human proteome

Christian Dallago; Burkhard Rost


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">protein subcellular location</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">protein embeddings</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">protein language models</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">protein secondary structure</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">protein prediction</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">human proteome</subfield>
  </datafield>
  <controlfield tag="005">20210701014819.0</controlfield>
  <controlfield tag="001">5047020</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Technical University of Munich</subfield>
    <subfield code="0">(orcid)0000-0003-0179-8424</subfield>
    <subfield code="a">Burkhard Rost</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Technical University of Munich</subfield>
    <subfield code="0">(orcid)0000-0003-4650-6181</subfield>
    <subfield code="4">prc</subfield>
    <subfield code="a">Christian Dallago</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Technical University of Munich</subfield>
    <subfield code="0">(orcid)0000-0003-0179-8424</subfield>
    <subfield code="4">dgs</subfield>
    <subfield code="a">Burkhard Rost</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">13722773</subfield>
    <subfield code="z">md5:63eb4cec5465cbfd4cac930fd5db6ee7</subfield>
    <subfield code="u">https://zenodo.org/record/5047020/files/DSSP3_human_ProtT5Sec.fasta</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">46557163296</subfield>
    <subfield code="z">md5:372f4e7b6099288b816ffb73659e469d</subfield>
    <subfield code="u">https://zenodo.org/record/5047020/files/embeddings_file.h5</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">13611513</subfield>
    <subfield code="z">md5:313c10a2a28606ad36a85eeb73a5399d</subfield>
    <subfield code="u">https://zenodo.org/record/5047020/files/human.fasta</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">92008224</subfield>
    <subfield code="z">md5:e8ae0c5a74bd13ba2e77d1d302dba083</subfield>
    <subfield code="u">https://zenodo.org/record/5047020/files/reduced_embeddings_file.h5</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">1020481</subfield>
    <subfield code="z">md5:fadaf9b20a0a70b175db77cb084a8b95</subfield>
    <subfield code="u">https://zenodo.org/record/5047020/files/subcell_human_LA_ProtT5.csv</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2021-06-30</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="o">oai:zenodo.org:5047020</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Technical University of Munich</subfield>
    <subfield code="0">(orcid)0000-0003-4650-6181</subfield>
    <subfield code="a">Christian Dallago</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Protein language model embeddings and predictions of the human proteome</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://opensource.org/licenses/afl-3.0</subfield>
    <subfield code="a">Academic Free License v3.0</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on&amp;nbsp;2021.06.09)&amp;nbsp;computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).&lt;/p&gt;

&lt;p&gt;Additionally:&lt;/p&gt;

&lt;p&gt;- Sequence-level&amp;nbsp;predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)&lt;/p&gt;

&lt;p&gt;- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported&amp;nbsp;in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;Files included:&lt;/p&gt;

&lt;p&gt;- human.fasta --&amp;gt; FASTA-formatted sequences of human from SwissProt&lt;/p&gt;

&lt;p&gt;-&amp;nbsp;DSSP3_human_ProtT5Sec.fasta --&amp;gt; Secondary structure predictions in three states for each residue of each protein&amp;nbsp;in human.fasta. &amp;quot;H&amp;quot; stands for Helix; &amp;quot;E&amp;quot; stands for Sheet; &amp;quot;C&amp;quot; stands for Other.&lt;/p&gt;

&lt;p&gt;-&amp;nbsp;subcell_human_LA_ProtT5.csv --&amp;gt; Subcellular location (10 states) and memrane-boundness (2 states)&amp;nbsp;for each protein in human.fasta&lt;/p&gt;

&lt;p&gt;-&amp;nbsp;embeddings_file.h5 --&amp;gt; per-residue embeddings of sequences in human.fasta. Each dataset&amp;nbsp;in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the &amp;quot;original_id&amp;quot; attribute. See&amp;nbsp;https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file&lt;/p&gt;

&lt;p&gt;-&amp;nbsp;reduced_embeddings_file.h5 --&amp;gt; per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset&amp;nbsp;in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="a">10.1101/2020.07.12.199554</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="a">10.1093/nar/gkab354/6276913</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="a">10.1101/2021.04.25.441334</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="a">10.1002/cpz1.113</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.5047019</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.5047020</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
184
114
views
downloads
All versions This version
Views 184184
Downloads 114114
Data volume 93.9 GB93.9 GB
Unique views 163163
Unique downloads 9090

Share

Cite as