Published August 23, 2021 | Version v1
Dataset Open

Embeddings from protein language models predict conservation and variant effects

Description

For this work, we used protein language model representations (embeddings) to predict sequence conservation without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthew Correlation Coefficient – MCC - for ProtT5 embeddings of 0.596±0.006 vs. 0.608±0.006 for ConSeq).

ConSurf10k- Dataset for the development of ProtT5cons: The method (ProtT5cons) predicting residue conservation used ConSurf-DB (Ben Chorin et al. 2020). This resource provided sequences and conservation for 89,673 proteins. For all, experimental high-resolution three-dimensional (3D) structures were available in the Protein Data Bank (PDB) (Berman et al. 2000). As standard-of-truth for the conservation prediction, we used the values from ConSurf-DB generated using HMMER (Mistry et al. 2013), CD-HIT (Fu et al. 2012), and MAFFT-LINSi (Katoh and Standley 2013) to align proteins in the PDB (Burley et al. 2019). For proteins from families with over 50 proteins in the resulting MSA, an evolutionary rate at each residue position is computed and used along with the MSA to reconstruct a phylogenetic tree. The ConSurf-DB conservation scores ranged from 1 (most variable) to 9 (most conserved). The PISCES server (Wang and Dunbrack 2003) was used to redundancy reduce the data set such that no pair of proteins had more than 25% pairwise sequence identity. We removed proteins with resolutions >2.5Å, those shorter than 40 residues, and those longer than 10,000 residues. The resulting data set (ConSurf10k) with 10,507 proteins (or domains) was randomly partitioned into training (9,392 sequences), cross-training/validation (555) and test (519) sets.

Uploaded data:

  • ConSuf10k_PDBid_seq_cons.fasta: fasta file with PDBid, sequence and conservation annotation
  • consurf10k_test_ids.txt: txt file with id's of test set
  • consurf10k_train_ids.txt: txt file with id's of train set
  • consurf10k_val_ids.txt: txt file with id's of cross-validation set

Notes

See details in the research paper.

Files

consurf10k_test_ids.txt

Files (26.9 MB)

Name Size Download all
md5:9256a2533f7c6458b3728db592f758c2
26.8 MB Download
md5:04f642c6ab9a769d5d3e0019523ebb34
3.6 kB Preview Download
md5:ffa272809bf8e987980e6b04b52fd284
65.7 kB Preview Download
md5:31932f77f7010303743ab6db5e45d054
3.9 kB Preview Download

Additional details

Related works

Is supplement to
Journal article: 10.21203/rs.3.rs-584804/v1 (DOI)

References

  • Ben Chorin A, Masrati G, Kessel A, Narunsky A, Sprinzak J, Lahav S, Ashkenazy H, Ben‐Tal N (2020) ConSurf‐DB: An accessible repository for the evolutionary conservation patterns of the majority of PDB proteins. Protein Science 29: 258-267. doi: 10.1002/pro.3779
  • Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Research 28: 235-242. doi: 10.1093/nar/28.1.235
  • Mistry J, Finn RD, Eddy SR, Bateman A, Punta M (2013) Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res 41: e121. doi: 10.1093/nar/gkt263
  • Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28: 3150-2. doi: 10.1093/bioinformatics/bts565
  • Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30: 772-80. doi: 10.1093/molbev/mst010
  • Burley SK, Berman HM, Bhikadiya C, Bi C, Chen L, Di Costanzo L, Christie C, Dalenberg K, Duarte JM, Dutta S, Feng Z, Ghosh S, Goodsell DS, Green RK, Guranovic V, Guzenko D, Hudson BP, Kalro T, Liang Y, Lowe R, Namkoong H, Peisach E, Periskova I, Prlic A, Randle C, Rose A, Rose P, Sala R, Sekharan M, Shao C, Tan L, Tao YP, Valasatava Y, Voigt M, Westbrook J, Woo J, Yang H, Young J, Zhuravleva M, Zardecki C (2019) RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Research 47: D464-D474. doi: 10.1093/nar/gky1004
  • Wang G, Dunbrack RL, Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19: 1589-1591. doi: 10.1093/bioinformatics/btg224