Published June 7, 2021 | Version 1.1.0
Dataset Open

DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction

  • 1. University of Missouri
  • 2. Oak Ridge National Laboratory

Description

This dataset contains replication data for the paper titled "DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction". The dataset consists of pickled Pandas DataFrame files, along with training, validation, and (for DB5-Plus) test filename lists for cross-validation, that can be used to develop and evaluate protein interface prediction models. This dataset also contains the externally generated residue-level PSAIA and HH-suite3 features for users' convenience (e.g. raw MSAs and profile HMMs for each protein complex). Our GitHub repository linked in the "Additional notes" metadata section below provides more details on how we parsed through these files to create our cross-validation datasets. The GitHub repository for DIPS-Plus also includes scripts that can be used to impute missing feature values and convert the final "raw" complexes into DGL-compatible graph objects. Since our final DGL graph representation for each complex uses PyTorch tensors in its construction of residue embeddings, the final representation of each complex can easily be adapted to fit the users' needs (e.g. feeding a complex's 2D residue feature tensors into a convolutional neural network).

Notes

This dataset can be updated periodically using the instructions contained in our GitHub repository for DIPS-Plus (https://github.com/BioinfoMachineLearning/DIPS-Plus). For data provenance, the complexes curated for DIPS-Plus originate from the RCSB's bound protein complex repository (https://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/divided/).

Files

Files (44.6 GB)

Name Size Download all
md5:9cbc07672e705f9ba8549168b06e1e06
42.8 MB Download
md5:775576dbb27cbe0127419dbdaa6c36d1
7.7 GB Download
md5:a61aea4af023abd17b6ddd19863c0ffc
6.4 kB Download
md5:cb1283cf1fb91a586d786a8e2c53053c
2.3 MB Download
md5:893fa1d932bbb0738f093ba634155d09
291.8 MB Download
md5:04088a0afca2107c0418868bb4380fb0
4.3 GB Download
md5:f7f14525ea07aabbadc52af25917e82b
4.3 GB Download
md5:afe62360640af90b4fc52c4044c84b4c
4.3 GB Download
md5:f132e558ebebf2d2d2a0765022d4c3f3
4.3 GB Download
md5:259ceccd4e2397e17712606f5e43f3e0
4.3 GB Download
md5:a4d8493d22652781225a3af3ef2ae724
4.3 GB Download
md5:0547b5b72b3912c22f6036f843a05f2a
4.3 GB Download
md5:072be5754b4c27241e761878a42647dd
3.7 GB Download
md5:fd17825eafd0bee22daddf1475336929
15.4 MB Download
md5:2925bba15a1f04b70f437fde982e4717
2.8 GB Download

Additional details

Related works

Funding

III: Medium: Collaborative Research: Guiding Exploration of Protein Structure Spaces with Deep Learning 1763246
National Science Foundation
ABI Innovation: Deep learning methods for protein bioinformatics 1759934
National Science Foundation