Exposing the limitations of molecular machine learning with activity cliffs.
- 1. Molecular Machine Learning group, Eindhoven University of Technology, Institute for Complex Molecular Systems and Dept. Biomedical Engineering, Groene Loper 7, 5612AZ Eindhoven, Netherland
- 2. JetBrains Research. Saint Petersburg, Russia
Description
Machine learning has become a crucial tool in drug discovery and chemistry at large, e.g., to predict molecular properties, such as bioactivity, with high levels of accuracy. However, activity cliffs – pairs of molecules that are highly similar in their structure but exhibit large differences in potency – have been underinvestigated for their effect on model performance. Not only are these edge cases informative for molecule discovery and optimization, but models that are well-equipped to accurately predict the potency of activity cliffs have an increased potential for prospective applications. Our work aims to fill the current knowledge gap on best-practice machine learning methods in the presence of activity cliffs. We benchmarked more than 20 machine and deep learning approaches on curated bioactivity data from 30 macromolecular targets for their performance on activity cliff compounds. While all methods struggled in the presence of activity cliffs, machine learning approaches based on molecular descriptors outperformed more complex deep learning methods. These results advocate for (a) the inclusion of dedicated “activity-cliff-centered” metrics during model development and evaluation, and (b) the development of novel algorithms to better predict the properties of activity cliff. To this end, the methods, metrics, and results of this study have been encapsulated into an open-access benchmarking platform named MoleculeACE (Activity Cliff Estimation, available on GitHub at: https://github.com/molML/MoleculeACE). MoleculeACE is designed to steer the community towards addressing the pressing but overlooked limitation of molecular machine learning models posed by activity cliffs.
This data deposit contains all trained models and the data used to train them. All models can be easily loaded and used to predict bioactivity on new molecules with MoleculeACE. Since models are target-specific, models are provided for all 30 data sets. Every model is accompanied by a configure file that describes its (optimized) hyperparameters.
Files
Data.zip
Files
(6.1 GB)
Name | Size | Download all |
---|---|---|
md5:db0b1ea1f57faf767738e1179ea6f4b2
|
6.1 GB | Preview Download |