Published December 22, 2017 | Version 1.0
Dataset Open

Supplementary Data for NIPS Publication: Protein Interface Prediction using Graph Convolutional Networks.

  • 1. Colorado State University

Description

These data sets can be used to re-run the experiments from our paper, Protein Interface Prediction using Graph Convolutional Networks. The data are derived from protein complexes in the docking benchmark dataset v. 5.0. Each file is a python tuple that has been saved using cPickle and compressed using gzip.

Links:

Paper: https://papers.nips.cc/paper/7231-protein-interface-prediction-using-graph-convolutional-networks

Poster: https://zenodo.org/record/1134154

Code: https://github.com/fouticus/pipgcn

 

File Descriptions:

train.cpkl.gz and test.cpkl.gz have the data formatted for neighborhood based graph convolutions. The diffc_ files are the same data formatted for the diffusion convolutional neural networks that we compare against. 

train.cpkl.gz is a tuple of length 2:

  • element 0 is a list of length 175 containing the PDB codes from the docking benchmark dataset
  • element 1 is a list of length 175 containing features for each protein. Each element is a dictionary containing the following keys:
    • r_vertex: vertex (residue) features for the receptor. numpy array of shape (x, 70) where x is the number of residues in the receptor and 70 is the number of features.
    • l_vertex: vertex (residue) features for the ligand. analogous to above, with shape (y, 70) where y is the number of residues in the ligand.
    • complex_code: PDB code of the complex. matches the list of codes described above.
    • l_edge: edge features for the neighborhood around each residue in the ligand. numpy array of shape (y, 20, 2) where y is defined as above. the second dimension is the edges to the 20 nearest neighboring residues, ordered by decreasing distance. The third dimension allows for two features per edge. 
    • r_edge: edge features for the neighborhood around each residue in the receptor. numpy array of shape (x, 20, 2) where x is as above. 
    • l_hood_indices: the index of the 20 closest residues to each residue, ordered by decreasing distance. numpy array of shape (y, 20, 1). "Index" means which row in l_vertex gives the vertex features for the closest neighbor, second closest neighbor, etc. 
    • r_hood_indices: analogous to above, shape (x, 20, 1).
    • label: 1 or -1 label for each residue pair. numpy array of shape (x*y, 3). Each row looks like (i, j, k) where i is the index of the ligand residue, j is the index of the receptor residue, and k is either -1 (negative example) or 1 (positive example).

test.cpkl.gz matches the structure of train.cpkl.gz except it has the test set of 55 complexes. 

Descriptions of the vertex and edge features can be found in Appendix A of  this.

diffc_g2_p2_train.cpkl.gz is a tuple of length 2:

  • element 0 is a list of the same 175 PDB codes as above. 
  • element 1 is a list of features for the 175 complexes. Each element is a dictionary of features with these keys:
    • r_vertex, l_vertex, complex_code, label: these are the same as described above. 
    • 'r_power_series': Stacked diffusion matrices which are powers of the similarity matrix used in the DCNN method. numpy array of shape (x, 2, x) where x is the number of receptor residues. the middle dimension 2 indicates how many "hops" is used for that diffusion (1 vs. 2). In other words, element (i, 0, j) is the similarity after 1 hops between residues i and j. element (i, 1, j) is the similarity after 2 hops. See DCNN paper for details.
    • 'l_power_series': same as above but for the ligand. shape is (y, 2, y).

diffc_g2_p2_test.cpkl.gz is the same as diffc_g2_p2_train.cpkl.gz but for the 55 test complexes.

diff_g2_p5_train.cpkl.gz and diff_g2_p5_test.cpkl.gz are the same as the p2 version above, except that the diffusion matrices have shape (x, 5, x) and (y, 5, y) because one of our comparisons against the DCNN model uses 5 hops instead of just 2. 

 

Note: these files were pickled with Python 2.7. If you're unpickling with Python 3.x you might have to specify encoding as 'latin1'. 

 

Please direct any questions to:

  • Alex Fout (fout@colostate.edu)
  • Jonathon Byrd (jonbyrd@colostate.edu)
  • Basir Shariat (basir@cs.colostate.edu
  • Asa Ben-Hur (asa@cs.colostate.edu)

Notes

This work was supported by the National Science Foundation under grant no DBI-1564840

Files

Files (4.2 GB)

Name Size Download all
md5:fbe377ff1db9cf5af35691d832be922b
408.8 MB Download
md5:c32ea60923c84714fe709ad06e9ae9fb
731.9 MB Download
md5:89008f14c330d60ef19a376e5c5a232d
1.1 GB Download
md5:09ec82a0cdc855062669a72e1d4790f7
1.9 GB Download
md5:801bb8582145af116a0092d8d18a3759
47.1 MB Download
md5:c9d461dec1c4a71a2a788051fde5b85b
51.2 MB Download

Additional details

References

  • Protein Interface Prediction using Graph Convolutional Networks