Protein language model embeddings and predictions for the fly proteome (FlyBase)

Dallago, Christian; Marquet, Céline; Rost, Burkhard

doi:10.5281/zenodo.6322184

Published March 2, 2022 | Version 2022.03.02

Dataset Open

Protein language model embeddings and predictions for the fly proteome (FlyBase)

1. Technical University of Munich

Contributors

1. Technical University of Munich

Residue and sequence embeddings of the fly (drosophila melanogaster) proteome (FlyBase for organism drosophila melanogaster, downloaded on 2022.03.01) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3). To open the embeddings file, please see this notebook. The embeddings will be indexed by numbers according to the mapping file (mapping_file.csv) in this dataset. All following results will share the same mapping (for instance, to access the variation prediction results, by accessing index "0", you will query results for the sequence "FBpp0304622").

Additionally:

- Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1)

- Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)

- Residue-level prediction of conservation (in 9 states) and of variation effect (from 0 [no-effect] to 1 [effect]) using VESPAl (https://doi.org/10.1007/s00439-021-02411-y)

Files included:

- dmel-all-translation-r6.44.fasta --> FASTA-formatted sequences of drosophila melanogaster from FlyBase

- mapping_file.csv --> A CSV file mapping the identifiers used in the following files (from 0 to 30737) to the identifiers in the FlyBase fasta file (dmel-all-translation-r6.44.fasta).

- DSSP3_fly_ProtT5Sec.fasta --> Secondary structure predictions in three states for each residue of each protein in dmel-all-translation-r6.44.fasta. "H" stands for Helix; "E" stands for Sheet; "C" stands for Other.

- subcell_fly_LA_ProtT5.csv --> Subcellular location (10 states) and memrane-boundness (2 states) for each protein in dmel-all-translation-r6.44.fasta

- embeddings_file.h5 --> per-residue embeddings of sequences in dmel-all-translation-r6.44.fasta. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the "original_id" attribute. See https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file.

- reduced_embeddings_file.h5 --> per-sequence embeddings of sequences in dmel-all-translation-r6.44.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).

- conspred_probs.h5 --> per-sequence conservation probability (softmax) prediction of sequences in dmel-all-translation-r6.44.fasta in 9 classes. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length 9xL, with L being the length of the protein sequence, and 9 being the predicted conservation class (index 0 = very variable; index 8 = very conserved)

- vespal_SAVeffect_fly.zip --> zipped .h5 file of per-sequence variation predictions of sequences in dmel-all-translation-r6.44.fasta on a scale from 0 (neutral) to 1 (effect). -1 indicates WT substitution. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length 20xL, with L being the length of the protein sequence, and 20 being the predicted variation score for each residue substitution (AAs in the following order: "ALGVSREDTIPKFQNYMHWC" . Meaning that index 0 = substitution of the residue to "A", index = 1 substitution to residue "L", aso.)

Files

mapping_file.csv

Files (86.1 GB)

Name	Size
conspred_probs.h5 md5:28dbabfd33a07d1fda9b01717317f6c1	743.3 MB	Download
dmel-all-translation-r6.44.fasta md5:ea2c7b44eda0044e51405e2e0b708eb4	34.3 MB	Download
DSSP3_fly_ProtT5Sec.fasta md5:16a5709966bf4a6ebd928b10a18e109a	34.5 MB	Download
embeddings_file.h5 md5:b5a4e3ca07a34f89ad8ee5cd93d5a250	83.3 GB	Download
mapping_file.csv md5:ab240f65c0205aa29943b83e5164bb99	669.0 kB	Preview Download
reduced_embeddings_file.h5 md5:53f9523a3c53c1113f908cb9ffdb64a6	138.5 MB	Download
subcell_fly_LA_ProtT5.csv md5:33edb34b4f0797410a3e84f5c815875f	1.4 MB	Preview Download
vespal_SAVeffect_fly.zip md5:376fdea5664b25fad1214db84d10b973	1.9 GB	Preview Download

Additional details

Is supplement to: Journal article: 10.1109/TPAMI.2021.3095381 (DOI); Journal article: 10.1093/nar/gkab354/6276913 (DOI); Journal article: 10.1093/bioadv/vbab035 (DOI); Journal article: 10.1002/cpz1.113 (DOI); Journal article: 10.1007/s00439-021-02411-y (DOI)

	All versions	This version
Views	478	478
Downloads	715	715
Data volume	6.8 TB	6.8 TB

Protein language model embeddings and predictions for the fly proteome (FlyBase)

Authors/Creators

Contributors

Contact person:

Researcher:

Supervisor:

Description

Files

mapping_file.csv

Files (86.1 GB)

Additional details

Related works