Data sets are provided containing promiscuous PAINS and dark chemical matter (DCM) PAINS in tab separated PROM_PAINS.DAT and DCM_PAINS.DAT files containing 5223 and 3059 unique compounds, respectively. These data sets were used for building global classification models models. Both the files consist of PubChem compound ids (cid) in the first column, SMILES representations of the compound in the second column, PAINS rules present in the compound in the third column (PAINS_rule) and an array of ECFP4 features for the compound in the fourth column. In some cases, multiple PubChem cids corresponding to one SMILES notation are separated by '_'.
In addition, balanced training and test data sets are provided in tab separated Balanced_training_set.DAT and Balanced_test_set.DAT files, respectively. Each file consists of PubChem compound id (cid) in the first column, labels of 1 for PROM_PAINS compound and 0 for DCM_compound in the second column and SMILES representation in the third column. Balanced training set and test set consist of 1900 and 1822 unique compounds, respectively.
Support vector machine models built on the basis of original and balanced data sets are provided in Global_svm_model.p and Balanced_svm_model.p files, respectively. These two files are pickled file formats for the models built with Scikit-learn in python. For building these models, scikit-learn ver 0.19.1, numpy ver 1.14.5 and pandas ver 0.23.3 packages were used in Python 3.6.
In addition, a table for ECFP4 features is provided in ECFP4_feature_table.DAT containing ids (idx) of 19668 ECFP4 features and their corresponding SMARTS pattern.