Towards Foundational Models for Molecular Learning on Large Scale Multi-Task Datasets
Authors/Creators
- Beaini, Dominique
- Huang, Shenyang
- Cunha, Joao Alex
- Li, Zhiyi
- Moisescu-Pareja, Gabriela
- Dymov, Oleksandr
- Maddrell-Mander, Samuel
- McLean, Callum
- Parviz, Ali
- Müller, Luis
- Mohamud, Jama Hussein
- Wenkel, Frederik
- Craig, Michael
- Koziarski, Michał
- Lu, Jiarui
- Zhu, Zhaocheng
- Gabellini, Cristian
- Rabusseau, Guillaume
- Rabbany, Reihaneh
- Tang, Jian
- Morris, Christopher
- Ravanelli, Mirco
- Wolf, Guy
- Tossou, Prudencio
- Mary, Hadrien
- Banaszewski, Blazej
- Martin, Chad
- Masters, Dominic
Description
Pretraining foundation models that adapt to a wide range of molecule tasks have been long pursued by the community of drug discovery. While self-supervised learning methods are developed to leverage the sheer number of unlabeled molecules for pretraining, the landscape of supervised learning is much underexplored due to the absence of proper datasets and codebases. To facilitate the study of supervised learning on molecules, we curate 7 datasets with node- and graph-level supervision, and develop a library for studying multi-task learning models. The datasets are separated into 2 categories. First, the Toy-mix category contains 3 small datasets that are well known and well studied in the literature, but with the additional constraint that they must be used in a multi-task setting. Second, the Large-mix category contains 4 large datasets that, together, contain tens of billions of graph-level data points and tens of billions node-level data points associated to 100M unique molecules, representing orders of magnitude more data than other 2D-GNN datasets. Since molecule tasks are distributed across multiple levels, we design our library to explicitly consider multi-tasking alongside multi-level representations, backed by a large collection of models and features for different levels. With such a library design to accompany the datasets, we hope to accelerate the development of foundational models for molecules.
Files
tox21.zip
Files
(96.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:5711017427ff253a45c589e10a6d4f37
|
96.5 kB | Preview Download |