Published October 2025 | Version v2
Dataset Open

Zero-Shot Protein Segmentation (ZPS) Data and Embeddings

  • 1. ROR icon University of Toronto
  • 2. ROR icon Hospital for Sick Children

Description

uniprotkb_Human.txt

  • this is a raw text file that contains a downloaded copy of UniProtKB
  • this inlcudes all reviewed human protein sequences 
  • we used annotations from this file to copmare to ZPS predictions

 

uniprotkb_Human_Sequences.fasta

  • this is a fasta file that contains reviewed human protein sequences
  • these are the sequences we used as input to ProtT5 to generate protein embeddings

 

ZPS_Boundaries.tsv

  • this is a tab separated file that contains the boundaries of protein segments defined by ZPS for reviewed human protein sequences
  • we used zero-based indexing for the protein boundaries

 

ZPS_Segment_Embeddings.hdf5

  • this is a hdf5 file that contains segment embeddings for the human proteome
  • see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" A. G. Sangster 2025 for definition of segment embeddings
  • segment boundaries in this file are also in zero-based indexing

 

 

evaluation_data.zip includes:

disprot_functional_annotations.tsv

  • this conatins DisProt annotations that are labeled as "molecular_function" or "disorder_function" for the human proteome
  • this is from the 2025-06 DisProt release

 

disprot_functional_annotations_per_segment.tsv

  • this is a parsed version of disprot_functional_annotations.tsv 
  • this includes protein segment keys and their corresponding disprot functional annotations
  • these labels were used for multi-label evaluations

 

protGPS_dataset.csv

 

ProtGPS_idmapping_2025_08_13.tsv

  • this is the ID mapping data downloaded from UniProt to map gene names and UniProt IDs found in protGPS_dataset.csv to UniProt IDs used in ZPS 

 

protGPS_data_only_disordered_segments.tsv

  • this is the parsed version of protGPS_dataset.csv 
  • this includes protein IDs, train/dev/test split, a list of labels attributed to the protein, and a list of segment keys that over-lap with MobiDB disorder annotations
  • these labels were used for multi-label evaluations

 

uniprot_annotations_per_segment_multi-class.tsv

  • this is a parsed version of uniprotkb_Human.txt
  • this includes protein segment keys, protein IDs, gene IDs, and labels used in multi-class evaluations
  • multi-class labels include:
  • PROSITE_LABELS: labels of the top ~20 most commonly occuring protein domains as annotated by ProRule on UniProt
  • IDR_VS_DOMAIN_LABELS: labels include Disordered (as annotated by MobiDB via UniProt), ProRule (as annotated by ProRule via UniProt, indicating domain), and Background (does not overlap with a MobiDB disorder annotation or a ProRule domain annotation)
  • COMP_BIAS_LABLES: labels for compositional bias annotation (as annotated by MobiDB via UniProt)
  • DISORDER_LABELS: for segments that overlap with a MobiDB disordered annotation (via UniProt), take the name of the other overlapping annotation with the highest IoU

 

uniprot_annotations_per_segment_multi-label.tsv

  • this is a parsed version of uniprotkb_Human.txt
  • this includes protein segment keys and labels used in multi-label evaluations

 

Protein segment keys: are formatted as "UniProtID start-stop", where start and stop positions reference the canonical protein sequence on UniProt and use zero-based indexing.

*see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" (A. G. Sangster 2025) on how annotations were transfered to protein segments

 

Files

evaluation_data.zip

Files (1.7 GB)

Name Size Download all
md5:73e091ed24712e378befd38a12590686
5.1 MB Preview Download
md5:74cc3d6a68f6aa0683096cad713e4656
438.5 MB Preview Download
md5:f884e572e031b70b447a6fff6c6327c4
13.7 MB Download
md5:24502e65fd03e39ff26c6515ecccc179
3.3 MB Download
md5:ebc8ae83bd39c1a10b09e4e1d3a99ff2
1.2 GB Download

Additional details

Dates

Created
2025-03-03