Published June 6, 2024 | Version v2
Dataset Open

Dataset used to train ProteinCLIP

  • 1. ROR icon Stanford University

Description

This contains embeddings for UniProt records used to train ProteinCLIP (see https://www.biorxiv.org/content/10.1101/2024.05.14.594226v1). This includes both embeddings from protein language models and natural language emebeddings of function. All records are stored in hdf5 files with identifiers as keys (e.g., P38398) and embedding arrays as their values. All embeddings are given as 1-dimensional vectors.

The following protein language models have associated embeddings:

  • ESM2, 6-layer
  • ESM2, 12-layer
  • ESM2, 30-layer
  • ESM2, 33-layer
  • ESM2, 36-layer
  • ProtT5

Some of the above protein embeddings are split into sub-file indicated by the suffix "_splitN"; these files when concatenated yield the full set of proteins and were originally created separately due to parallel processing of embeddings.

Text embeddings are generated by OpenAI's "text-embedding-3-large" model.

We also include the raw files used to create these embeddings. Namely the ".dat.gz" file contains the archive of UniProt annotations including the function fields we parse for creating function text embeddings, and ".fasta.gz" file containing corresponding sequences. 

Data splits used to train the ProteinCLIP models described in our preprint are contained in data_splits.json. 

Files

data_splits.json

Files (29.3 GB)

Name Size Download all
md5:2e4d8bf2d270db5523c79a73a8da3b41
8.4 MB Preview Download
md5:ac10cc304a6f3c7e6bebed414dfa0dcc
324.9 MB Download
md5:62638a06958e0b9feda902b552a7eb59
324.9 MB Download
md5:d034b871f672b8935050c7f8810ea8a9
324.9 MB Download
md5:f951d2f395bc6adaac6a9946a5c2655e
324.9 MB Download
md5:8df4dc7027f6b21b364191e337318985
416.1 MB Download
md5:a3047c74815f0eea0fe6dfe680061ac7
416.1 MB Download
md5:538d63aea4d221d0e4154ce08b8ed5b7
416.1 MB Download
md5:f3efaa9d5fc38770dd2262a98c224099
416.1 MB Download
md5:d7d4e9d3155ec4a3198db5ccee9c12ca
781.0 MB Download
md5:7e56e2f48f381245c19a40a1ac3a3597
780.9 MB Download
md5:f3ebf4649816ebb2870c71e107be7317
780.9 MB Download
md5:ad39ff074a9055a0a264531764b1f8cf
780.9 MB Download
md5:715d791489a63ca332fdfdbe479e0195
6.0 GB Download
md5:b72ea7b51d0f3f424ad22282304f323f
233.7 MB Download
md5:98e3aeb98123ea9bbb6cbc9eb039b010
233.7 MB Download
md5:3c8ff04bf4c444caedc699c62806d343
233.7 MB Download
md5:63d79dfcfec2ea80a2a50bc8bd03c3a3
233.7 MB Download
md5:69b9dd061e962ddb0002f39838f7faa8
2.5 GB Download
md5:bad433edc6eb6afd317d2a4f42009a71
662.8 MB Download
md5:74f4a3aa3c797a031e52b6a661881821
95.6 MB Download
md5:f6b3e5b423cc761d844cc21bff18b53d
13.0 GB Download

Additional details