Zero-Shot Protein Segmentation (ZPS) Data and Embeddings

Sangster, Ami G.; Dufault, Cameron; Qu, Haoning; Le, Denise; Forman-Kay, Julie; Moses, Alan

doi:10.5281/zenodo.17458967

Published October 2025 | Version v2

Dataset Open

Zero-Shot Protein Segmentation (ZPS) Data and Embeddings

1. University of Toronto
2. Hospital for Sick Children

uniprotkb_Human.txt

this is a raw text file that contains a downloaded copy of UniProtKB
this inlcudes all reviewed human protein sequences
we used annotations from this file to copmare to ZPS predictions

uniprotkb_Human_Sequences.fasta

this is a fasta file that contains reviewed human protein sequences
these are the sequences we used as input to ProtT5 to generate protein embeddings

ZPS_Boundaries.tsv

this is a tab separated file that contains the boundaries of protein segments defined by ZPS for reviewed human protein sequences
we used zero-based indexing for the protein boundaries

ZPS_Segment_Embeddings.hdf5

this is a hdf5 file that contains segment embeddings for the human proteome
see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" A. G. Sangster 2025 for definition of segment embeddings
segment boundaries in this file are also in zero-based indexing

evaluation_data.zip includes:

disprot_functional_annotations.tsv

this conatins DisProt annotations that are labeled as "molecular_function" or "disorder_function" for the human proteome
this is from the 2025-06 DisProt release

disprot_functional_annotations_per_segment.tsv

this is a parsed version of disprot_functional_annotations.tsv
this includes protein segment keys and their corresponding disprot functional annotations
these labels were used for multi-label evaluations

protGPS_dataset.csv

this is a copy of the dataset provided on DOI 10.5281/zenodo.14795444 in notebook/dataset.csv

ProtGPS_idmapping_2025_08_13.tsv

this is the ID mapping data downloaded from UniProt to map gene names and UniProt IDs found in protGPS_dataset.csv to UniProt IDs used in ZPS

protGPS_data_only_disordered_segments.tsv

this is the parsed version of protGPS_dataset.csv
this includes protein IDs, train/dev/test split, a list of labels attributed to the protein, and a list of segment keys that over-lap with MobiDB disorder annotations
these labels were used for multi-label evaluations

uniprot_annotations_per_segment_multi-class.tsv

this is a parsed version of uniprotkb_Human.txt
this includes protein segment keys, protein IDs, gene IDs, and labels used in multi-class evaluations
multi-class labels include:
PROSITE_LABELS: labels of the top ~20 most commonly occuring protein domains as annotated by ProRule on UniProt
IDR_VS_DOMAIN_LABELS: labels include Disordered (as annotated by MobiDB via UniProt), ProRule (as annotated by ProRule via UniProt, indicating domain), and Background (does not overlap with a MobiDB disorder annotation or a ProRule domain annotation)
COMP_BIAS_LABLES: labels for compositional bias annotation (as annotated by MobiDB via UniProt)
DISORDER_LABELS: for segments that overlap with a MobiDB disordered annotation (via UniProt), take the name of the other overlapping annotation with the highest IoU

uniprot_annotations_per_segment_multi-label.tsv

this is a parsed version of uniprotkb_Human.txt
this includes protein segment keys and labels used in multi-label evaluations

Protein segment keys: are formatted as "UniProtID start-stop", where start and stop positions reference the canonical protein sequence on UniProt and use zero-based indexing.

*see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" (A. G. Sangster 2025) on how annotations were transfered to protein segments

Files

evaluation_data.zip

Files (1.7 GB)

Name	Size	Download all
evaluation_data.zip md5:73e091ed24712e378befd38a12590686	5.1 MB	Preview Download
uniprotkb_Human.txt md5:74cc3d6a68f6aa0683096cad713e4656	438.5 MB	Preview Download
uniprotkb_Human_Sequences.fasta md5:f884e572e031b70b447a6fff6c6327c4	13.7 MB	Download
ZPS_Boundaries.tsv md5:24502e65fd03e39ff26c6515ecccc179	3.3 MB	Download
ZPS_Segment_Embeddings.hdf5 md5:ebc8ae83bd39c1a10b09e4e1d3a99ff2	1.2 GB	Download

Additional details

Created: 2025-03-03

	All versions	This version
Views	194	102
Downloads	380	190
Data volume	144.3 GB	63.0 GB

Zero-Shot Protein Segmentation (ZPS) Data and Embeddings

Authors/Creators

Description

Files

evaluation_data.zip

Files (1.7 GB)

Additional details

Dates