Published 2025 | Version 1.0.0
Dataset Open

Data from: The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations

  • 1. ROR icon Google (Switzerland)
  • 2. Google (Germany)
  • 3. ROR icon Technische Universität Berlin
  • 4. ROR icon Korea University
  • 5. ROR icon Max Planck Institute for Informatics

Description

Machine learning (ML) methods enable prediction of the properties of chemical structures without computationally expensive ab initio calculations. The quality of such predictions depends on the reference data that was used to train the model. In this work, we introduce the QCML dataset: A comprehensive dataset for training ML models for quantum chemistry. The QCML dataset systematically covers chemical space with small molecules consisting of up to 8 heavy atoms and includes elements from a large fraction of the periodic table, as well as different electronic states. Starting from chemical graphs, conformer search and normal mode sampling are used to generate both equilibrium and off-equilibrium 3D structures, for which various properties are calculated with semi-empirical methods (14.7 billion entries) and density functional theory (33.5 million entries). The covered properties include energies, forces, multipole moments, and other quantities, e.g. Kohn-Sham matrices. We provide a first demonstration of the utility of our dataset by training ML-based force fields on the data and applying them to run molecular dynamics simulations.

The data is available as TensorFlow dataset (TFDS) and can be accessed from the publicly available Google Cloud Storage at gs://qcml-datasets/tfds/. (See "Directory structure" below.)

For information on different access options (command-line tools, client libraries, etc), please see https://cloud.google.com/storage/docs/access-public-data

Directory structure

  • gs://qcml-datasets (GCS Bucket)
    • tfds (TFDS data directory)
      • qcml (TFDS dataset name)
        • dft_atomic_numbers (TFDS builder config name)
          • 1.0.0 (Current version)
            • dataset_info.json
            • features.json
            • qcml-full.tfrecord-X-of-Y (TFDS data shards, see below)
        • ...
        • dft_positions
        • xtb_all

Builder configurations

Format: Builder config name: number of shards (rounded total size)

Semi-empirical calculations:

  • xtb_all: 85000 (69 TB)

DFT calculations:

  • dft_atomic_numbers: 11 (3 GB)
  • dft_d4_atomic_charges: 11 (4 GB)
  • dft_d4_c6_coefficients: 11 (4 GB)
  • dft_d4_correction: 11 (8 GB)
  • dft_d4_energy: 11 (2 GB)
  • dft_d4_forces: 11 (7 GB)
  • dft_d4_polarizabilities: 11 (4 GB)
  • dft_force_field: 11 (18 GB)
  • dft_force_field_d4: 110 (24 GB)
  • dft_force_field_mbd: 110 (24 GB)
  • dft_gfn0_dipole: 11 (3 GB)
  • dft_gfn0_eeq_charges: 11 (4 GB)
  • dft_gfn0_energy: 11 (2 GB)
  • dft_gfn0_forces: 11 (7 GB)
  • dft_gfn0_formation_energy: 11 (3 GB)
  • dft_gfn0_orbital_energies_a: 11 (8 GB)
  • dft_gfn0_orbital_occupations_a: 11 (8 GB)
  • dft_gfn0_wiberg_bond_orders: 110 (29 GB)
  • dft_gfn2_dipole: 11 (3 GB)
  • dft_gfn2_energy: 11 (2 GB)
  • dft_gfn2_forces: 11 (7 GB)
  • dft_gfn2_formation_energy: 11 (3 GB)
  • dft_gfn2_mulliken_charges: 11 (4 GB)
  • dft_gfn2_orbital_energies_a: 11 (7 GB)
  • dft_gfn2_orbital_occupations_a: 11 (7 GB)
  • dft_gfn2_wiberg_bond_orders: 110 (29 GB)
  • dft_is_outlier: 11 (2 GB)
  • dft_mbd_c6_coefficients: 11 (4 GB)
  • dft_mbd_correction: 11 (8 GB)
  • dft_mbd_energy: 11 (2 GB)
  • dft_mbd_forces: 11 (7 GB)
  • dft_mbd_polarizabilities: 11 (4 GB)
  • dft_metadata: 11 (11 GB)
  • dft_multipole_moments: 11 (8 GB)
  • dft_pbe0_core_hamiltonian_matrix: 110000 (30 TB)
  • dft_pbe0_density_matrix_a: 110000 (30 TB)
  • dft_pbe0_density_matrix_b: 110000 (3 TB)
  • dft_pbe0_dipole: 11 (3 GB)
  • dft_pbe0_electronic_free_energy: 11 (3 GB)
  • dft_pbe0_energy: 11 (2 GB)
  • dft_pbe0_forces: 11 (7 GB)
  • dft_pbe0_formation_energy: 11 (3 GB)
  • dft_pbe0_grid_density_a: 110000 (27 TB)
  • dft_pbe0_grid_density_b: 110000 (3 TB)
  • dft_pbe0_grid_density_gradient_a: 110000 (81 TB)
  • dft_pbe0_grid_density_gradient_b: 110000 (10 TB)
  • dft_pbe0_grid_density_laplacian_a: 110000 (27 TB)
  • dft_pbe0_grid_density_laplacian_b: 110000 (3 TB)
  • dft_pbe0_grid_kinetic_energy_density_a: 110000 (27 TB)
  • dft_pbe0_grid_kinetic_energy_density_b: 110000 (3 TB)
  • dft_pbe0_grid_points: 110000 (81 TB)
  • dft_pbe0_grid_weight: 110000 (27 TB)
  • dft_pbe0_guid: 11 (3 GB)
  • dft_pbe0_hamiltonian_matrix_a: 110000 (30 TB)
  • dft_pbe0_hamiltonian_matrix_b: 110000 (3 TB)
  • dft_pbe0_has_equal_a_b_electrons: 11 (3 GB)
  • dft_pbe0_hexadecapole: 11 (3 GB)
  • dft_pbe0_hirshfeld_charges: 11 (4 GB)
  • dft_pbe0_hirshfeld_dipoles: 11 (8 GB)
  • dft_pbe0_hirshfeld_quadrupoles: 11 (11 GB)
  • dft_pbe0_hirshfeld_spins: 11 (3 GB)
  • dft_pbe0_hirshfeld_volume_ratios: 11 (4 GB)
  • dft_pbe0_hirshfeld_volumes: 11 (4 GB)
  • dft_pbe0_loewdin_charges: 11 (4 GB)
  • dft_pbe0_loewdin_spins: 11 (3 GB)
  • dft_pbe0_mulliken_charges: 11 (4 GB)
  • dft_pbe0_mulliken_spins: 11 (3 GB)
  • dft_pbe0_num_scf_iterations: 11 (3 GB)
  • dft_pbe0_octupole: 11 (3 GB)
  • dft_pbe0_orbital_coefficients_a: 110000 (30 TB)
  • dft_pbe0_orbital_coefficients_b: 110000 (3 TB)
  • dft_pbe0_orbital_energies_a: 110 (44 GB)
  • dft_pbe0_orbital_energies_b: 11 (8 GB)
  • dft_pbe0_orbital_occupations_a: 110 (44 GB)
  • dft_pbe0_orbital_occupations_b: 11 (8 GB)
  • dft_pbe0_overlap_matrix: 110000 (30 TB)
  • dft_pbe0_quadrupole: 11 (3 GB)
  • dft_pbe0_zero_broadening_corrected_energy: 11 (3 GB)
  • dft_population_analysis: 11 (19 GB)
  • dft_positions: 11 (7 GB)

Files

Files (10.1 kB)

Name Size Download all
md5:2e89b6fb506841b19bf115c6f260f66e
3.4 kB Download
md5:8456b798d14ecc0d4ef84c9101918a0b
6.7 kB Download

Additional details

Dates

Created
2024