Dataset Open Access

Industry-scale Application and Evaluation of Deep Learning for Drug Target Prediction

Noé Sturm; Andreas Mayr; Thanh Le Van; Vladimir Chupakhin; Hugo Ceulemans; Joerg Wegner; Jose-Felipe Golib-Dzib; Nina Jeliazkova; Yves Vandriessche; Stanislav Bohm; Vojtech Cima; Jan Martinovic; Nigel Greene; Tom Vander Aa; Thomas J. Ashby; Sepp Hochreiter; Ola Engkvist; Günter Klambauer; Hongming Chen

Artificial intelligence (AI) is undergoing a revolution thanks to the breakthroughs of machine learning algorithms in computer vision, speech recognition, natural language processing and generative modelling. Recent works on publicly available pharmaceutical data showed that AI methods are highly promising for Drug Target prediction. However, the quality of public data might be different than that of industry data due to different labs reporting measurements, different measurement techniques, fewer samples and less diverse and specialized assays. As part of a European funded project (ExCAPE), that brought together expertise from pharmaceutical industry, machine learning, and high-performance computing, we investigated how well machine learning models obtained from public data can be transferred to internal pharmaceutical industry data. Our results show that machine learning models trained on public data can indeed maintain their predictive power to a large degree when applied to industry data. Moreover, we observed that deep learning derived machine learning models outperformed comparable models, which were trained by other machine learning algorithms, when applied to internal pharmaceutical company datasets. To our knowledge, this is the first large-scale study evaluating the potential of machine learning and especially deep learning directly at the level of industry-scale settings and moreover investigating the transferability of publicly learned target prediction models towards industrial bioactivity prediction pipelines.

Dataset Format to reproduce manuscript results
Files (1.2 GB)
Name Size
activities.txt.gz
md5:8dc6e1b81c1ef73c577d7434d9cc0132
190.9 MB Download
activities_levels.txt.gz
md5:19410b01e9c5da2fbb77cd6449f2f5d5
189.9 MB Download
clustering.txt.gz
md5:4e428221f720322083c24048b6fa4c34
11.9 MB Download
ecfp6_counts.txt.gz
md5:e26b290fbbd31d4ab27485ced96855e5
405.7 MB Download
ecfp6_counts_var005.txt.gz
md5:b1f33641eda1a79aed6824c1fed0f6d3
166.5 MB Download
ecfp6_folded.txt.gz
md5:0848288b6f097c7ffd87dc54cfb2bad4
92.7 MB Download
excapedb_compound_info.txt.gz
md5:a82bfca83dbf239b77bda17a9c638e7f
81.0 MB Download
excapeml_compound_info.txt.gz
md5:69264fecb0318e3643f0330c93cbf5d5
78.1 MB Download
folds.txt.gz
md5:4d9f73df78012ac6a61d7219d35d5a4b
306.3 kB Download
protein_descriptors.txt.gz
md5:4b309ea5ba664a1c725cdbe8c91b10a9
361.1 kB Download
README
md5:da5357d64be5e4f78d407c8b5ad417f3
902 Bytes Download
successful_samples.txt.gz
md5:1ee902d003e3856282172971043407f3
16.5 MB Download
321
500
views
downloads
All versions This version
Views 321207
Downloads 500480
Data volume 92.7 GB49.7 GB
Unique views 259176
Unique downloads 122103

Share

Cite as