Published June 24, 2022 | Version 1.0
Dataset Open

XTB2-MolData : Dataset of 12 Million Molecules

  • 1. Univ. Lyon, Université Claude Bernard Lyon 1, CNRS, Institut Lumière Matière, UMR5306, F-69622 Villeurbanne, France
  • 2. Univ. Lyon, Université Claude Bernard Lyon 1, CNRS, Institut Lumière Matière, UMR5306, F-69622 Villeurbanne, France

Description

This dataset is an open chemistry database containing optimized molecular geometries and electronic properties calculated by the GFN2-xTB method (C. Bannwarth et al.) for 12.6 million organic molecules contained C, H, O, and N atoms.

The initial geometries, before optimization by GFN2-xTB method, are taken from PubChem PM6 (Shimazaki et al.) database.

We also include our python code to manage a large molecule database. This code includes scripts to generate input files for Gaussian software, to read Gaussian output files, to create a small reduced dataset based on clustering algorithm, and many scripts to analyze the molecular properties included in the database.

This code can be also taken from github:  https://github.com/Castaneche/MolDataFW.

Files

MolData_XTB2_V1.zip

Files (34.2 GB)

Name Size Download all
md5:9adfae897fc6c9c7869b2e7193124d15
34.2 GB Preview Download

Additional details

References

  • Tomomi Shimazaki, Masatomo Hashimoto, and Toshiyuki Maeda J. Chem. Inf. Model. 2020, 60, 12, 5891–5899 https://doi.org/10.1021/acs.jcim.0c00740
  • C. Bannwarth, E. Caldeweyher, S. Ehlert, A. Hansen, P. Pracht, J. Seibert, S. Spicher, S. Grimme WIREs Comput. Mol. Sci., 2020, 11, e01493. DOI: 10.1002/wcms.1493