Embeddings from protein language models predict conservation and variant effects

Marquet, Céline; Heinzinger, Michael; Olenyi, Tobias; Dallago, Christian; Erckert, Kyra; Bernhofer, Michael; Nechaev, Dmitrii

doi:10.5281/zenodo.5238537

Published August 23, 2021 | Version v1

Dataset Open

Embeddings from protein language models predict conservation and variant effects

1. TUM

For this work, we used protein language model representations (embeddings) to predict sequence conservation without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthew Correlation Coefficient – MCC - for ProtT5 embeddings of 0.596±0.006 vs. 0.608±0.006 for ConSeq).

ConSurf10k- Dataset for the development of ProtT5cons: The method (ProtT5cons) predicting residue conservation used ConSurf-DB (Ben Chorin et al. 2020). This resource provided sequences and conservation for 89,673 proteins. For all, experimental high-resolution three-dimensional (3D) structures were available in the Protein Data Bank (PDB) (Berman et al. 2000). As standard-of-truth for the conservation prediction, we used the values from ConSurf-DB generated using HMMER (Mistry et al. 2013), CD-HIT (Fu et al. 2012), and MAFFT-LINSi (Katoh and Standley 2013) to align proteins in the PDB (Burley et al. 2019). For proteins from families with over 50 proteins in the resulting MSA, an evolutionary rate at each residue position is computed and used along with the MSA to reconstruct a phylogenetic tree. The ConSurf-DB conservation scores ranged from 1 (most variable) to 9 (most conserved). The PISCES server (Wang and Dunbrack 2003) was used to redundancy reduce the data set such that no pair of proteins had more than 25% pairwise sequence identity. We removed proteins with resolutions >2.5Å, those shorter than 40 residues, and those longer than 10,000 residues. The resulting data set (ConSurf10k) with 10,507 proteins (or domains) was randomly partitioned into training (9,392 sequences), cross-training/validation (555) and test (519) sets.

Uploaded data:

ConSuf10k_PDBid_seq_cons.fasta: fasta file with PDBid, sequence and conservation annotation
consurf10k_test_ids.txt: txt file with id's of test set
consurf10k_train_ids.txt: txt file with id's of train set
consurf10k_val_ids.txt: txt file with id's of cross-validation set

Notes

See details in the research paper.

Files

consurf10k_test_ids.txt

Files (26.9 MB)

Name	Size	Download all
ConSuf10k_PDBid_seq_cons.fasta md5:9256a2533f7c6458b3728db592f758c2	26.8 MB	Download
consurf10k_test_ids.txt md5:04f642c6ab9a769d5d3e0019523ebb34	3.6 kB	Preview Download
consurf10k_train_ids.txt md5:ffa272809bf8e987980e6b04b52fd284	65.7 kB	Preview Download
consurf10k_val_ids.txt md5:31932f77f7010303743ab6db5e45d054	3.9 kB	Preview Download

Additional details

Is supplement to: Journal article: 10.21203/rs.3.rs-584804/v1 (DOI)

Ben Chorin A, Masrati G, Kessel A, Narunsky A, Sprinzak J, Lahav S, Ashkenazy H, Ben‐Tal N (2020) ConSurf‐DB: An accessible repository for the evolutionary conservation patterns of the majority of PDB proteins. Protein Science 29: 258-267. doi: 10.1002/pro.3779
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Research 28: 235-242. doi: 10.1093/nar/28.1.235
Mistry J, Finn RD, Eddy SR, Bateman A, Punta M (2013) Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res 41: e121. doi: 10.1093/nar/gkt263
Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28: 3150-2. doi: 10.1093/bioinformatics/bts565
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30: 772-80. doi: 10.1093/molbev/mst010
Burley SK, Berman HM, Bhikadiya C, Bi C, Chen L, Di Costanzo L, Christie C, Dalenberg K, Duarte JM, Dutta S, Feng Z, Ghosh S, Goodsell DS, Green RK, Guranovic V, Guzenko D, Hudson BP, Kalro T, Liang Y, Lowe R, Namkoong H, Peisach E, Periskova I, Prlic A, Randle C, Rose A, Rose P, Sala R, Sekharan M, Shao C, Tan L, Tao YP, Valasatava Y, Voigt M, Westbrook J, Woo J, Yang H, Young J, Zhuravleva M, Zardecki C (2019) RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Research 47: D464-D474. doi: 10.1093/nar/gky1004
Wang G, Dunbrack RL, Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19: 1589-1591. doi: 10.1093/bioinformatics/btg224

	All versions	This version
Views	615	610
Downloads	322	321
Data volume	3.1 GB	3.1 GB

Embeddings from protein language models predict conservation and variant effects

Notes

Files

consurf10k_test_ids.txt

Files (26.9 MB)

Additional details

Related works

References

Embeddings from protein language models predict conservation and variant effects

Creators

Description

Notes

Files

consurf10k_test_ids.txt

Files (26.9 MB)

Additional details

Related works

References