Dataset used to train ProteinCLIP
Description
This contains embeddings for UniProt records used to train ProteinCLIP (see https://www.biorxiv.org/content/10.1101/2024.05.14.594226v1). This includes both embeddings from protein language models and natural language emebeddings of function. All records are stored in hdf5 files with identifiers as keys (e.g., P38398) and embedding arrays as their values. All embeddings are given as 1-dimensional vectors.
The following protein language models have associated embeddings:
- ESM2, 6-layer
- ESM2, 12-layer
- ESM2, 30-layer
- ESM2, 33-layer
- ESM2, 36-layer
- ProtT5
Some of the above protein embeddings are split into sub-file indicated by the suffix "_splitN"; these files when concatenated yield the full set of proteins and were originally created separately due to parallel processing of embeddings.
Text embeddings are generated by OpenAI's "text-embedding-3-large" model.
We also include the raw files used to create these embeddings. Namely the ".dat.gz" file contains the archive of UniProt annotations including the function fields we parse for creating function text embeddings, and ".fasta.gz" file containing corresponding sequences.
Data splits used to train the ProteinCLIP models described in our preprint are contained in data_splits.json.
Files
data_splits.json
Files
(29.3 GB)
Name | Size | Download all |
---|---|---|
md5:2e4d8bf2d270db5523c79a73a8da3b41
|
8.4 MB | Preview Download |
md5:ac10cc304a6f3c7e6bebed414dfa0dcc
|
324.9 MB | Download |
md5:62638a06958e0b9feda902b552a7eb59
|
324.9 MB | Download |
md5:d034b871f672b8935050c7f8810ea8a9
|
324.9 MB | Download |
md5:f951d2f395bc6adaac6a9946a5c2655e
|
324.9 MB | Download |
md5:8df4dc7027f6b21b364191e337318985
|
416.1 MB | Download |
md5:a3047c74815f0eea0fe6dfe680061ac7
|
416.1 MB | Download |
md5:538d63aea4d221d0e4154ce08b8ed5b7
|
416.1 MB | Download |
md5:f3efaa9d5fc38770dd2262a98c224099
|
416.1 MB | Download |
md5:d7d4e9d3155ec4a3198db5ccee9c12ca
|
781.0 MB | Download |
md5:7e56e2f48f381245c19a40a1ac3a3597
|
780.9 MB | Download |
md5:f3ebf4649816ebb2870c71e107be7317
|
780.9 MB | Download |
md5:ad39ff074a9055a0a264531764b1f8cf
|
780.9 MB | Download |
md5:715d791489a63ca332fdfdbe479e0195
|
6.0 GB | Download |
md5:b72ea7b51d0f3f424ad22282304f323f
|
233.7 MB | Download |
md5:98e3aeb98123ea9bbb6cbc9eb039b010
|
233.7 MB | Download |
md5:3c8ff04bf4c444caedc699c62806d343
|
233.7 MB | Download |
md5:63d79dfcfec2ea80a2a50bc8bd03c3a3
|
233.7 MB | Download |
md5:69b9dd061e962ddb0002f39838f7faa8
|
2.5 GB | Download |
md5:bad433edc6eb6afd317d2a4f42009a71
|
662.8 MB | Download |
md5:74f4a3aa3c797a031e52b6a661881821
|
95.6 MB | Download |
md5:f6b3e5b423cc761d844cc21bff18b53d
|
13.0 GB | Download |
Additional details
Software
- Repository URL
- https://github.com/wukevin/proteinclip