Published July 10, 2022 | Version v1
Dataset Open

Prediction and Visualization of Human Transmembrane Proteins using AlphaFold and Protein Language Models

  • 1. Technical University Munich

Description

Description: TMvis ("TMvis496.tar.gz") is a dataset containing 496 3D-structures of predicted human transmembrane proteins (TMP) and their predicted membrane embedding. The method TMbed [1], based on the protein language model ProtT5 [2] predicted 4.967 TMP for the human proteome (20,375 proteins, UniProt [3] version April 2022; excluding TITIN_HUMAN due to length). For these proteins, we obtained AlphaFold [4] structures from AlphaFoldDB [5] with an average per-residue confidence score (pLDDT) of more than 90%. This resulted in the 496 proteins of TMvis, as can be found in "TMvis496.fasta". The membrane embedding was predicted using the methods ANVIL [6], PPM3 [7], and per-residue TMbed predictions. As the three methods are based on different approaches, we decided to publish results for all. The figure “TMvis_project_overview.png” provides a graphical overview for each step described above.

TMvis Folder Structure: TMvis is separated into “alpha” containing predicted alpha-helical TMPs, and “beta” containing predicted beta-barrel TMPs. Within these folders, each protein is assigned one folder, identifiable by the respective unique UniProt ID. Each protein folder consists of:
- “UniprotID.fasta” with UniProt ID, sequence, TMbed per-residue prediction
- “AF-UniprotID-F1-model_v2.pdb” with the AlphaFold structure
- “AF-UniprotID-F1-model_v2.cif” with the AlphaFold structure
- “AF-UniprotID-F1-model_v2_ANVIL.pdb” with predicted ANVIL membrane embedding
- “AF-UniprotID-F1-model_v2_ppm.pdb” predicted PPM3 membrane embedding

TMvis                           
|                                    
├── alpha                                 
│   │                                 
│   ├── A0A087X1C5                                 
│   │   ├── A0A087X1C5.fasta                                 
│   │   ├── AF-A0A087X1C5-F1-model_v2.pdb                                 
│   │   ├── AF-A0A087X1C5-F1-model_v2.cif                                 
│   │   ├── AF-A0A087X1C5-F1-model_v2_ANVIL.pdb                                 
│   │   └── AF-A0A087X1C5-F1-model_v2_ppm.PDB                                 
│   └── ...                                    
└── beta                                 
    └── P45880

TMvis visualization: The 3D-visualization of every protein in the dataset TMvis can be easily accessed using the Jupyter Notebook “TMvis.ipynb”. It contains detailed descriptions the different membrane prediction tools ANVIL, PPM3, and TMbed as well as the respective code. Additionally, it allows to visualize the per-residue confidence scores (pLDDT) of AlphaFold.

——————————————————————————————————————————————————————————————————————————

References:

[1] TMbed - TMbed Bernhofer, Michael, and Burkhard Rost. 2022. “TMbed – Transmembrane Proteins Predicted through Language Model Embeddings.” bioRxiv.

[2] ProtT5 - A. Elnaggar et al., "ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3095381.

[3] UniProt - UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 49(D1), D480–D489.

[4] AlphaFold - AlphaFold Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.

[5] Alphafold DB - Varadi, Mihaly, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, et al. 2022. “AlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models.” Nucleic Acids Research 50 (D1): D439–44.

[6] ANVIL - ANVIL Postic, Guillaume, Yassine Ghouzam, Vincent Guiraud, and Jean-Christophe Gelly. 2016. “Membrane Positioning for High- and Low-Resolution Protein Structures through a Binary Classification Approach.” Protein Engineering, Design & Selection: PEDS 29 (3): 87–91.

[7] PPM3 - PPM3 Lomize, Mikhail A., Irina D. Pogozheva, Hyeon Joo, Henry I. Mosberg, and Andrei L. Lomize. 2012. “OPM Database and PPM Web Server: Resources for Positioning of Proteins in Membranes.” Nucleic Acids Research 40 (Database issue): D370–76.

——————————————————————————————————————————————————————————————————————————

License:

This work is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).

 

Files

TMvis.ipynb

Files (166.1 MB)

Name Size Download all
md5:2e411876dd69a1536bccaca1dfa758d0
21.9 kB Preview Download
md5:334de9b20db3c50931e1f8d77936c63d
453.5 kB Download
md5:a6f10d29f1967deffb01e9858b0bd824
165.2 MB Download
md5:d2e49a0ac3d9bb37558fbda61d5d429c
370.0 kB Preview Download
md5:f159ffedc837cc183df58566c555dd54
6.5 kB Preview Download