Towards Foundational Models for Molecular Learning on Large Scale Multi-Task Datasets

Beaini, Dominique; Huang, Shenyang; Cunha,  Joao Alex; Li, Zhiyi; Moisescu-Pareja, Gabriela; Dymov, Oleksandr; Maddrell-Mander, Samuel; McLean, Callum; Parviz, Ali; Müller, Luis; Mohamud, Jama Hussein; Wenkel, Frederik; Craig, Michael; Koziarski, Michał; Lu, Jiarui; Zhu, Zhaocheng; Gabellini, Cristian; Rabusseau, Guillaume; Rabbany, Reihaneh; Tang, Jian; Morris, Christopher; Ravanelli, Mirco; Wolf, Guy; Tossou, Prudencio; Mary, Hadrien; Banaszewski, Blazej; Martin, Chad; Masters, Dominic

doi:10.5281/zenodo.7998402

Published June 2, 2023 | Version v1

Preprint Open

Towards Foundational Models for Molecular Learning on Large Scale Multi-Task Datasets

Pretraining foundation models that adapt to a wide range of molecule tasks have been long pursued by the community of drug discovery. While self-supervised learning methods are developed to leverage the sheer number of unlabeled molecules for pretraining, the landscape of supervised learning is much underexplored due to the absence of proper datasets and codebases. To facilitate the study of supervised learning on molecules, we curate 7 datasets with node- and graph-level supervision, and develop a library for studying multi-task learning models. The datasets are separated into 2 categories. First, the Toy-mix category contains 3 small datasets that are well known and well studied in the literature, but with the additional constraint that they must be used in a multi-task setting. Second, the Large-mix category contains 4 large datasets that, together, contain tens of billions of graph-level data points and tens of billions node-level data points associated to 100M unique molecules, representing orders of magnitude more data than other 2D-GNN datasets. Since molecule tasks are distributed across multiple levels, we design our library to explicitly consider multi-tasking alongside multi-level representations, backed by a large collection of models and features for different levels. With such a library design to accompany the datasets, we hope to accelerate the development of foundational models for molecules.

Files

tox21.zip

Files (96.5 kB)

Name	Size	Download all
tox21.zip md5:5711017427ff253a45c589e10a6d4f37	96.5 kB	Preview Download

	All versions	This version
Views	3,501	131
Downloads	6,156	34
Data volume	6.2 TB	3.3 MB

Towards Foundational Models for Molecular Learning on Large Scale Multi-Task Datasets

Authors/Creators

Description

Files

tox21.zip

Files (96.5 kB)