Handling missing data, censored values and measurement error in machine learning models using multiple imputation for early stage drug discovery

Swiers, Rowan

doi:10.5281/zenodo.3697280

Published March 4, 2020 | Version v1

Presentation Open

Handling missing data, censored values and measurement error in machine learning models using multiple imputation for early stage drug discovery

Swiers, Rowan

Multiple imputation is a technique for handling missing data, censored values and measurement error. Currently it is underused in the machine learning field due to lack of familiarity and experience with the technique, whilst other missing data solutions such as full Bayesian models can be hard to set up. However, randomization-based evaluations of Bayesianly derived repeated imputations can provide approximately valid inference of the posterior distributions and allow use of techniques which rely upon complete data such as SVMs and random Forest models.

This paper, using simulated data sets inspired by AstraZeneca drug data, shows how multiple imputation techniques can improve the analysis of data with missing values or with uncertainty. We pay close attention to the prediction of Bayesian posterior coverage due its importance in industrial applications. Comparisons are made to other commonly used methods of handling missing data such as single uniform imputation and data removal. Furthermore, we review several standard multiple imputation models and compare them on our simulated data sets. We provide recommendations on when to use each technique and where extra care is needed based upon data distributions. Finally, using simulated data, we give examples of how correct use of multiple imputation can affect investment decisions in the early stages of drug discovery.

Analysis was performed using both Python and Stan and is provided in a Jupyter notebook.

Files

notebookRowan.pdf

Files (1.2 MB)

Name	Size	Download all
notebookRowan.pdf md5:7da6c3259381731ef3fc2e32fae32b85	1.2 MB	Preview Download

	All versions	This version
Views	181	181
Downloads	449	447
Data volume	541.8 MB	539.5 MB

Handling missing data, censored values and measurement error in machine learning models using multiple imputation for early stage drug discovery

Authors/Creators

Description

Files

notebookRowan.pdf

Files (1.2 MB)