Published March 2, 2022 | Version 2022.03.02

Protein language model embeddings and predictions for the fly proteome (FlyBase)

  • 1. Technical University of Munich

Contributors

Contact person:

Supervisor:

  • 1. Technical University of Munich

Description

Residue and sequence embeddings of the fly (drosophila melanogaster) proteome (FlyBase for organism drosophila melanogaster, downloaded on 2022.03.01) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3). To open the embeddings file, please see this notebook. The embeddings will be indexed by numbers according to the mapping file (mapping_file.csv) in this dataset. All following results will share the same mapping (for instance, to access the variation prediction results, by accessing index "0", you will query results for the sequence "FBpp0304622").

Additionally:

- Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)

- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)

- Residue-level prediction of conservation (in 9 states) and of variation effect (from 0 [no-effect] to 1 [effect]) using VESPAl (https://doi.org/10.1007/s00439-021-02411-y)

 

Files included:

- dmel-all-translation-r6.44.fasta --> FASTA-formatted sequences of drosophila melanogaster from FlyBase

- mapping_file.csv --> A CSV file mapping the identifiers used in the following files (from 0 to 30737) to the identifiers in the FlyBase fasta file (dmel-all-translation-r6.44.fasta).

- DSSP3_fly_ProtT5Sec.fasta --> Secondary structure predictions in three states for each residue of each protein in dmel-all-translation-r6.44.fasta. "H" stands for Helix; "E" stands for Sheet; "C" stands for Other.

- subcell_fly_LA_ProtT5.csv --> Subcellular location (10 states) and memrane-boundness (2 states) for each protein in dmel-all-translation-r6.44.fasta

- embeddings_file.h5 --> per-residue embeddings of sequences in dmel-all-translation-r6.44.fasta. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the "original_id" attribute. See https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file.

- reduced_embeddings_file.h5 --> per-sequence embeddings of sequences in dmel-all-translation-r6.44.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).

- conspred_probs.h5 --> per-sequence conservation probability (softmax) prediction of sequences in dmel-all-translation-r6.44.fasta in 9 classes. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length 9xL, with L being the length of the protein sequence, and 9 being the predicted conservation class (index 0 = very variable; index 8 = very conserved)

- vespal_SAVeffect_fly.zip --> zipped .h5 file of per-sequence variation predictions of sequences in dmel-all-translation-r6.44.fasta on a scale from 0 (neutral) to 1 (effect). -1 indicates WT substitution. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length 20xL, with L being the length of the protein sequence, and 20 being the predicted variation score for each residue substitution (AAs in the following order: "ALGVSREDTIPKFQNYMHWC" . Meaning that index 0 = substitution of the residue to "A", index = 1 substitution to residue "L", aso.)

Files

mapping_file.csv

Files (86.1 GB)

Name Size
md5:28dbabfd33a07d1fda9b01717317f6c1
743.3 MB Download
md5:ea2c7b44eda0044e51405e2e0b708eb4
34.3 MB Download
md5:16a5709966bf4a6ebd928b10a18e109a
34.5 MB Download
md5:b5a4e3ca07a34f89ad8ee5cd93d5a250
83.3 GB Download
md5:ab240f65c0205aa29943b83e5164bb99
669.0 kB Preview Download
md5:53f9523a3c53c1113f908cb9ffdb64a6
138.5 MB Download
md5:33edb34b4f0797410a3e84f5c815875f
1.4 MB Preview Download
md5:376fdea5664b25fad1214db84d10b973
1.9 GB Preview Download

Additional details

Related works

Is supplement to
Journal article: 10.1109/TPAMI.2021.3095381 (DOI)
Journal article: 10.1093/nar/gkab354/6276913 (DOI)
Journal article: 10.1093/bioadv/vbab035 (DOI)
Journal article: 10.1002/cpz1.113 (DOI)
Journal article: 10.1007/s00439-021-02411-y (DOI)