{
  "DOI": "10.5281/zenodo.5047020",
  "abstract": "Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on\u00a02021.06.09)\u00a0computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).\n\n\nAdditionally:\n\n\n- Sequence-level\u00a0predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)\n\n\n- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported\u00a0in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)\n\n\n\u00a0\n\n\nFiles included:\n\n\n- human.fasta --> FASTA-formatted sequences of human from SwissProt\n\n\n-\u00a0DSSP3_human_ProtT5Sec.fasta --> Secondary structure predictions in three states for each residue of each protein\u00a0in human.fasta. \"H\" stands for Helix; \"E\" stands for Sheet; \"C\" stands for Other.\n\n\n-\u00a0subcell_human_LA_ProtT5.csv --> Subcellular location (10 states) and memrane-boundness (2 states)\u00a0for each protein in human.fasta\n\n\n-\u00a0embeddings_file.h5 --> per-residue embeddings of sequences in human.fasta. Each dataset\u00a0in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the \"original_id\" attribute. See\u00a0https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file\n\n\n-\u00a0reduced_embeddings_file.h5 --> per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset\u00a0in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).",
  "author": [
    {
      "family": "Christian Dallago"
    },
    {
      "family": "Burkhard Rost"
    }
  ],
  "id": "5047020",
  "issued": {
    "date-parts": [
      [
        "2021",
        "06",
        "30"
      ]
    ]
  },
  "publisher": "Zenodo",
  "title": "Protein language model embeddings and predictions of the human proteome",
  "type": "dataset",
  "version": "2021.06.09"
}