Zero-Shot Protein Segmentation (ZPS) Data and Embeddings
Authors/Creators
Description
uniprotkb_Human.txt
- this is a raw text file that contains a downloaded copy of UniProtKB
- this inlcudes all reviewed human protein sequences
- we used annotations from this file to copmare to ZPS predictions
uniprotkb_Human_Sequences.fasta
- this is a fasta file that contains reviewed human protein sequences
- these are the sequences we used as input to ProtT5 to generate protein embeddings
ZPS_Boundaries.tsv
- this is a tab separated file that contains the boundaries of protein segments defined by ZPS for reviewed human protein sequences
- we used zero-based indexing for the protein boundaries
ZPS_Segment_Embeddings.hdf5
- this is a hdf5 file that contains segment embeddings for the human proteome
- see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" A. G. Sangster 2025 for definition of segment embeddings
- segment boundaries in this file are also in zero-based indexing
evaluation_data.zip includes:
disprot_functional_annotations.tsv
- this conatins DisProt annotations that are labeled as "molecular_function" or "disorder_function" for the human proteome
- this is from the 2025-06 DisProt release
disprot_functional_annotations_per_segment.tsv
- this is a parsed version of disprot_functional_annotations.tsv
- this includes protein segment keys and their corresponding disprot functional annotations
- these labels were used for multi-label evaluations
protGPS_dataset.csv
- this is a copy of the dataset provided on DOI 10.5281/zenodo.14795444 in notebook/dataset.csv
ProtGPS_idmapping_2025_08_13.tsv
- this is the ID mapping data downloaded from UniProt to map gene names and UniProt IDs found in protGPS_dataset.csv to UniProt IDs used in ZPS
protGPS_data_only_disordered_segments.tsv
- this is the parsed version of protGPS_dataset.csv
- this includes protein IDs, train/dev/test split, a list of labels attributed to the protein, and a list of segment keys that over-lap with MobiDB disorder annotations
- these labels were used for multi-label evaluations
uniprot_annotations_per_segment_multi-class.tsv
- this is a parsed version of uniprotkb_Human.txt
- this includes protein segment keys, protein IDs, gene IDs, and labels used in multi-class evaluations
- multi-class labels include:
- PROSITE_LABELS: labels of the top ~20 most commonly occuring protein domains as annotated by ProRule on UniProt
- IDR_VS_DOMAIN_LABELS: labels include Disordered (as annotated by MobiDB via UniProt), ProRule (as annotated by ProRule via UniProt, indicating domain), and Background (does not overlap with a MobiDB disorder annotation or a ProRule domain annotation)
- COMP_BIAS_LABLES: labels for compositional bias annotation (as annotated by MobiDB via UniProt)
- DISORDER_LABELS: for segments that overlap with a MobiDB disordered annotation (via UniProt), take the name of the other overlapping annotation with the highest IoU
uniprot_annotations_per_segment_multi-label.tsv
- this is a parsed version of uniprotkb_Human.txt
- this includes protein segment keys and labels used in multi-label evaluations
Protein segment keys: are formatted as "UniProtID start-stop", where start and stop positions reference the canonical protein sequence on UniProt and use zero-based indexing.
*see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" (A. G. Sangster 2025) on how annotations were transfered to protein segments
Files
evaluation_data.zip
Files
(1.7 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:73e091ed24712e378befd38a12590686
|
5.1 MB | Preview Download |
|
md5:74cc3d6a68f6aa0683096cad713e4656
|
438.5 MB | Preview Download |
|
md5:f884e572e031b70b447a6fff6c6327c4
|
13.7 MB | Download |
|
md5:24502e65fd03e39ff26c6515ecccc179
|
3.3 MB | Download |
|
md5:ebc8ae83bd39c1a10b09e4e1d3a99ff2
|
1.2 GB | Download |
Additional details
Dates
- Created
-
2025-03-03