Open source application for small molecule pKa predictions
Authors/Creators
- 1. TU Dortmund University, Dortmund/DE
- 2. California State University, Long Beach/USA
Description
The acid-base dissociation constant (pKa) of a drug has a far-reaching influence on pharmacokinetics by altering the solubility, membrane permeability and protein binding affinity of the drug [1,2]. To the best of our knowledge, there is no publicly available, open source and license-free pKa prediction tool that can reach the quality of commercial tools. Our goal is to develop a highly accurate pKa prediction tool based on a mixture of freely accessible and commercial data which is however still free to use for everyone.
To do so, we identified multiple freely available experimental pKa datasets, including data from DataWarrior, ChEMBL and various scientific publications. Additionally, we have access to several commercial data sets, for example from Novartis [3] and OpenEye [4].
We started with monoprotic molecules obtained from ChEMBL and DataWarrior to evaluate the dataset quality and our modelling concepts. After preprocessing 5994 unique structures with a pKa value between 2 and 12 were used for machine learning. We tested seven different machine learning configurations including four different basic regressors together with six unique descriptor/fingerprint sets resulting in a total number of 42 trained and 5-fold cross-validated models. Additionally, we evaluated the models with two external test datasets. The results have been published in March 2020 [5]. Furthermore, we investigated how Graph Convolutional Networks and QM-based approaches can be used to further improve prediction quality.
To be able to predict the pKa values of multiprotic molecules, two major problems had to be solved: Localization of the titratable groups without licensed software and the once-only assignment of the experimental values to the corresponding groups for all datasets. For the localization part we evaluated the results of Marvin [6] and Dimorphite-DL [7] to compile a list of 24 SMARTS pattern that catch almost 90% of all groups in our combined dataset of over 17000 unique molecules. Finally, the Marvin [6] predictions were used to assign the experimental values to the corresponding group while removing outliers. The resulting data set can be used as a starting point for machine learning in a following step.
All data and code can be found at https://github.com/czodrowskilab
[1] Charifson, P. S., & Walters, W. P. (2014). Acidic and Basic Drugs in Medicinal Chemistry: A Perspective. Journal of Medicinal Chemistry
[2] Manallack, D. T. (2007). The pKa Distribution of Drugs: Application to Drug Discovery. Perspectives in Medicinal Chemistry
[3] Richard A. Lewis, Stephane Rodde, Novartis Pharma AG, Basel, Switzerland
[4] pKa COMPLETE_DATABASE v1.13: OpenEye Scientific Software, Santa Fe, NM.
[5] Baltruschat M and Czodrowski P. Machine learning meets pKa [version 2; peer review: 2 approved]. F1000Research 2020, 9(Chem Inf Sci):113
[6] Marvin 20.1.0, 2020, ChemAxon Ltd, http://www.chemaxon.com
[7] Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An open-source program for numerating the ionization states of drug-like small molecules. Journal of Cheminformatics
Files
GCC_2020_Baltruschat_pKa-predictions_slides.pdf
Files
(83.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:3b119309fade9e49d03930ec0d7b612c
|
2.9 MB | Preview Download |
|
md5:e86a35559147df8f6e47a12581275814
|
80.3 MB | Preview Download |
Additional details
Related works
- Continues
- Journal article: 10.12688/f1000research.22090.2 (DOI)