Published May 24, 2023 | Version 1.0.0
Dataset Open

MISATO - Machine learning dataset for structure-based drug discovery

  • 1. Helmholtz Munich, Molecular Targets and Therapeutics Center, Institute of Structural Biology, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany.
  • 2. Forschungszentrum Jülich, Jülich Supercomputing Centre, Jülich, Germany.
  • 3. Helmholtz AI, Helmholtz Zentrum München, Neuherberg, Germany.
  • 4. Helmholtz Munich, Computational Health Center, Institute of Computational Biology, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany.


Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.


Funding: BMWi ZIM. KK 5197901TS0. BMBF, SUPREME, 031L0268.



Files (193.2 GB)

Name Size Download all
5.7 GB Download
132.8 GB Download
54.3 GB Download
343.1 MB Download
8.1 kB Preview Download
68.8 kB Preview Download
8.0 kB Preview Download