Handling missing data, censored values and measurement error in machine learning models using multiple imputation for early stage drug discovery
Authors/Creators
Description
Multiple imputation is a technique for handling missing data, censored values and measurement error. Currently it is underused in the machine learning field due to lack of familiarity and experience with the technique, whilst other missing data solutions such as full Bayesian models can be hard to set up. However, randomization-based evaluations of Bayesianly derived repeated imputations can provide approximately valid inference of the posterior distributions and allow use of techniques which rely upon complete data such as SVMs and random Forest models.
This paper, using simulated data sets inspired by AstraZeneca drug data, shows how multiple imputation techniques can improve the analysis of data with missing values or with uncertainty. We pay close attention to the prediction of Bayesian posterior coverage due its importance in industrial applications. Comparisons are made to other commonly used methods of handling missing data such as single uniform imputation and data removal. Furthermore, we review several standard multiple imputation models and compare them on our simulated data sets. We provide recommendations on when to use each technique and where extra care is needed based upon data distributions. Finally, using simulated data, we give examples of how correct use of multiple imputation can affect investment decisions in the early stages of drug discovery.
Analysis was performed using both Python and Stan and is provided in a Jupyter notebook.
Files
notebookRowan.pdf
Files
(1.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7da6c3259381731ef3fc2e32fae32b85
|
1.2 MB | Preview Download |