Published February 16, 2024 | Version v1
Dataset Open

Dataset of AlphaFold's internal representations of 4,581 proteins relevant for drug discovery

  • 1. ROR icon Leiden University

Description

This dataset contains the outputs of the AlphaFold model for 4,581 proteins that are relevant targets in drug discovery.

More information on the dataset can be found at the following repository:

Dataset structure:

data/* -> main data directory

data/PID/* -> data of a single protein of length L

Filename Description Tensor shape Lightweight
single.npy ( s i ) evoformer single representation [L x 384] ✔️
structure.npy ( a i ) output of the last layer of structure module [L x 384] ✔️
msa.npy*** ( m s i ) processed MSA representation [N x L x 256]  
pair.npy*** ( z i j ) evoformer pair representation [L x L x 128]  
PID.pdb 3D protein structure prediction   ✔️
PID_unrelaxed.pdb 3D protein structure prediction w/o relaxation step (D)   ✔️
confidence.npy* confidence in structure prediction (0-100) 1 ✔️
plldt.npy* confidence in structure prediction per residue [L] ✔️
PID.fasta protein amino acid sequence and metadata   ✔️
timings.json Processing log   ✔️

data/PID2/* -> data of protein #2

...

*Note: L: sequence length, N: number of aligned sequences via MSA.

Files

FoldedPapyrus_4581_v01.zip

Files (6.1 GB)

Name Size Download all
md5:4bccb348b2a0dfed4f0e1b0f9d9253f4
6.1 GB Preview Download

Additional details

Software

Repository URL
https://github.com/andriusbern/foldedPapyrus
Programming language
Python