Published June 4, 2018 | Version v1.0.0
Dataset Open

Improving the drug discovery process by using multiple classifier systems

Authors/Creators

  • 1. University of Vigo

Description

High-quality dataset gathered from ChEMBL version 22 based on UniProt accession P34972. Regarding to activity data potential, duplicates were ignored, no activity or data validity comments were allowed, only data from binding assays and with a pCheMBL value were kept. This led to a dataset composed of 3925 chemical compounds (instances) represented using 2132 features. The first 2048 features epitomize different chemical structures fingerprints (represented using FCFP_6 notation), while the remaining 84 are associated with several physicochemical descriptors (such as Fractional Polar Surface Area, Rotatable Bonds or Molecular Weight). Finally, the set was transformed into a binary classification set where the activity cut-off was defined at a pChEMBL value > 7 and written to a tab-delimited text file. The final set contained 1977 active compounds and 1948 inactive compounds. Table 3 shows the codification of each feature grouped by type.

Files

d4n_corpus_physchem.csv

Files (17.8 MB)

Name Size Download all
md5:9f5da5cd7ab2172624dad674dd4ee970
17.8 MB Preview Download