Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published February 1, 2018 | Version 10008681
Journal article Open

Optimized Preprocessing for Accurate and Efficient Bioassay Prediction with Machine Learning Algorithms

Description

Bioassay is the measurement of the potency of a chemical substance by its effect on a living animal or plant tissue. Bioassay data and chemical structures from pharmacokinetic and drug metabolism screening are mined from and housed in multiple databases. Bioassay prediction is calculated accordingly to determine further advancement. This paper proposes a four-step preprocessing of datasets for improving the bioassay predictions. The first step is instance selection in which dataset is categorized into training, testing, and validation sets. The second step is discretization that partitions the data in consideration of accuracy vs. precision. The third step is normalization where data are normalized between 0 and 1 for subsequent machine learning processing. The fourth step is feature selection where key chemical properties and attributes are generated. The streamlined results are then analyzed for the prediction of effectiveness by various machine learning algorithms including Pipeline Pilot, R, Weka, and Excel. Experiments and evaluations reveal the effectiveness of various combination of preprocessing steps and machine learning algorithms in more consistent and accurate prediction.

Files

10008681.pdf

Files (982.3 kB)

Name Size Download all
md5:41ce6027156b1bc884760fea38a2d085
982.3 kB Preview Download

Additional details

References

  • A. C. Schierz, "Virtual screening of bioassay data," Journal of Cheminformatics 1(21), 2009.
  • Pipeline Pilot, BIOVIA' Graphical Scientific Workflow Authoring Application, http://accelrys.com/, last access 2017.
  • R, Big Data Statistical Computing and Graphics Software Environment, http://www.rdatamining.com/, last access 2017.
  • Weka, Waikato Environment for Knowledge Analysis, http://www.cs.waikato.ac.nz/~ml/, last access 2017.
  • Excel, Microsoft Excel Office Tool, https://products.office.com/en-us/excel, last access 2017.
  • M. Hassan, R. D. Brown, S. Varma-O'brien, and D. Rogers, "Cheminformatics analysis and learning in a data pipelining environment," Mol Divers 10(3), pp. 283-99, 2006.
  • S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, "Data Preprocessing for Supervised Learning," International Journal of Computer Science 1(2), pp. 111-117, 2006.
  • J. L. Lustgarten, V. Gopalakrishnan, H. Grover, and S. Visweswaran, "Improving Classification Performance with Discretization on Biomedical Datasets," Proc. AMIA AnnuSymp, pp. 445–449, 2008.
  • Daylight, Chemical Information Processing Software, http://www.daylight.com/, last access 2017. [10] BCI, Cheminformatics Software, http://www.digitalchemistry.co.uk/, last access 2017. [11] UNITY 2D, Biosimulation Software, http://tripos.com/, last access 2015. [12] MDL, http://accelrys.com/, last access 2017. [13] P. Ozer, "Data Mining Algorithms for Classification," BSc Thesis Artificial Intelligence, 2008. [14] Bewick, L. Cheek, and J. Ball, "Statistic review 14: Logistic Regression," Crit Care 9(1), pp. 112-118, 2005. [15] Multilayer Perceptron, Artificial Neural Network Modeling http://www.cs.waikato.ac.nz/ml/weka/documentation.html, last access 2017. [16] Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2011.