Improving the drug discovery process by using multiple classifier systems
Description
High-quality dataset gathered from ChEMBL version 22 based on UniProt accession P34972. Regarding to activity data potential, duplicates were ignored, no activity or data validity comments were allowed, only data from binding assays and with a pCheMBL value were kept. This led to a dataset composed of 3925 chemical compounds (instances) represented using 2132 features. The first 2048 features epitomize different chemical structures fingerprints (represented using FCFP_6 notation), while the remaining 84 are associated with several physicochemical descriptors (such as Fractional Polar Surface Area, Rotatable Bonds or Molecular Weight). Finally, the set was transformed into a binary classification set where the activity cut-off was defined at a pChEMBL value > 7 and written to a tab-delimited text file. The final set contained 1977 active compounds and 1948 inactive compounds. Table 3 shows the codification of each feature grouped by type.
Files
d4n_corpus_physchem.csv
Files
(17.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:9f5da5cd7ab2172624dad674dd4ee970
|
17.8 MB | Preview Download |