There is a newer version of the record available.

Published June 2, 2023 | Version v1
Preprint Open

Towards Foundational Models for Molecular Learning on Large Scale Multi-Task Datasets

Description

Pretraining foundation models that adapt to a wide range of molecule tasks have been long pursued by the community of drug discovery. While self-supervised learning methods are developed to leverage the sheer number of unlabeled molecules for pretraining, the landscape of supervised learning is much underexplored due to the absence of proper datasets and codebases. To facilitate the study of supervised learning on molecules, we curate 7 datasets with node- and graph-level supervision, and develop a library for studying multi-task learning models. The datasets are separated into 2 categories. First, the Toy-mix category contains 3 small datasets that are well known and well studied in the literature, but with the additional constraint that they must be used in a multi-task setting. Second, the Large-mix category contains 4 large datasets that, together, contain tens of billions of graph-level data points and tens of billions node-level data points associated to 100M unique molecules, representing orders of magnitude more data than other 2D-GNN datasets. Since molecule tasks are distributed across multiple levels, we design our library to explicitly consider multi-tasking alongside multi-level representations, backed by a large collection of models and features for different levels. With such a library design to accompany the datasets, we hope to accelerate the development of foundational models for molecules. 

Files

tox21.zip

Files (96.5 kB)

Name Size Download all
md5:5711017427ff253a45c589e10a6d4f37
96.5 kB Preview Download