Supplementary Data for NIPS Publication: Protein Interface Prediction using Graph Convolutional Networks.
Description
These data sets can be used to re-run the experiments from our paper, Protein Interface Prediction using Graph Convolutional Networks. The data are derived from protein complexes in the docking benchmark dataset v. 5.0. Each file is a python tuple that has been saved using cPickle and compressed using gzip.
Links:
Paper: https://papers.nips.cc/paper/7231-protein-interface-prediction-using-graph-convolutional-networks
Poster: https://zenodo.org/record/1134154
Code: https://github.com/fouticus/pipgcn
File Descriptions:
train.cpkl.gz and test.cpkl.gz have the data formatted for neighborhood based graph convolutions. The diffc_ files are the same data formatted for the diffusion convolutional neural networks that we compare against.
train.cpkl.gz is a tuple of length 2:
- element 0 is a list of length 175 containing the PDB codes from the docking benchmark dataset
- element 1 is a list of length 175 containing features for each protein. Each element is a dictionary containing the following keys:
- r_vertex: vertex (residue) features for the receptor. numpy array of shape (x, 70) where x is the number of residues in the receptor and 70 is the number of features.
- l_vertex: vertex (residue) features for the ligand. analogous to above, with shape (y, 70) where y is the number of residues in the ligand.
- complex_code: PDB code of the complex. matches the list of codes described above.
- l_edge: edge features for the neighborhood around each residue in the ligand. numpy array of shape (y, 20, 2) where y is defined as above. the second dimension is the edges to the 20 nearest neighboring residues, ordered by decreasing distance. The third dimension allows for two features per edge.
- r_edge: edge features for the neighborhood around each residue in the receptor. numpy array of shape (x, 20, 2) where x is as above.
- l_hood_indices: the index of the 20 closest residues to each residue, ordered by decreasing distance. numpy array of shape (y, 20, 1). "Index" means which row in l_vertex gives the vertex features for the closest neighbor, second closest neighbor, etc.
- r_hood_indices: analogous to above, shape (x, 20, 1).
- label: 1 or -1 label for each residue pair. numpy array of shape (x*y, 3). Each row looks like (i, j, k) where i is the index of the ligand residue, j is the index of the receptor residue, and k is either -1 (negative example) or 1 (positive example).
test.cpkl.gz matches the structure of train.cpkl.gz except it has the test set of 55 complexes.
Descriptions of the vertex and edge features can be found in Appendix A of this.
diffc_g2_p2_train.cpkl.gz is a tuple of length 2:
- element 0 is a list of the same 175 PDB codes as above.
- element 1 is a list of features for the 175 complexes. Each element is a dictionary of features with these keys:
- r_vertex, l_vertex, complex_code, label: these are the same as described above.
- 'r_power_series': Stacked diffusion matrices which are powers of the similarity matrix used in the DCNN method. numpy array of shape (x, 2, x) where x is the number of receptor residues. the middle dimension 2 indicates how many "hops" is used for that diffusion (1 vs. 2). In other words, element (i, 0, j) is the similarity after 1 hops between residues i and j. element (i, 1, j) is the similarity after 2 hops. See DCNN paper for details.
- 'l_power_series': same as above but for the ligand. shape is (y, 2, y).
diffc_g2_p2_test.cpkl.gz is the same as diffc_g2_p2_train.cpkl.gz but for the 55 test complexes.
diff_g2_p5_train.cpkl.gz and diff_g2_p5_test.cpkl.gz are the same as the p2 version above, except that the diffusion matrices have shape (x, 5, x) and (y, 5, y) because one of our comparisons against the DCNN model uses 5 hops instead of just 2.
Note: these files were pickled with Python 2.7. If you're unpickling with Python 3.x you might have to specify encoding as 'latin1'.
Please direct any questions to:
- Alex Fout (fout@colostate.edu)
- Jonathon Byrd (jonbyrd@colostate.edu)
- Basir Shariat (basir@cs.colostate.edu
- Asa Ben-Hur (asa@cs.colostate.edu)
Notes
Files
Files
(4.2 GB)
Name | Size | Download all |
---|---|---|
md5:fbe377ff1db9cf5af35691d832be922b
|
408.8 MB | Download |
md5:c32ea60923c84714fe709ad06e9ae9fb
|
731.9 MB | Download |
md5:89008f14c330d60ef19a376e5c5a232d
|
1.1 GB | Download |
md5:09ec82a0cdc855062669a72e1d4790f7
|
1.9 GB | Download |
md5:801bb8582145af116a0092d8d18a3759
|
47.1 MB | Download |
md5:c9d461dec1c4a71a2a788051fde5b85b
|
51.2 MB | Download |
Additional details
References
- Protein Interface Prediction using Graph Convolutional Networks