Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published July 14, 2023 | Version 1.0.0
Dataset Open

choderalab/download-qca-datasets: QM datasets deposited in QCArchive to train espaloma-0.3

Description

This is a collection of QM datasets deposited in the QCArchive to train espaloma-0.3. Extensive chemical molecules, such as small molecules, peptides, and nucleic acids, were downloaded, and their QM potential energies, forces, coordinates, atomic numbers, and canonical isomeric explicit hydrogen mapped SMILES are stored as HDF5 files. All QM data are computed with B3LYP-D3BJ/DZVP level of theory. Note that these datasets were downloaded from the legacy QCArchive server. The QM datasets are described in the following publication:

Kenichiro Takaba, Iván Pulido, Pavan Kumar Behara, Mike Henry, Hugo MacDermott Opeskin, John D. Chodera, Yuanqing Wang. "Machine-learned molecular mechanics force field for the simulation of protein-ligand systems and beyond" (arXiv:2307.07085)

The HDF5 file is structured as follows.

  • There is one top level group for each unique molecule with a key name (molecular ID, an amino acid sequence, or a SMILES string).
  • Each group contains the following datasets. N is the number of atoms in the molecule and M is the number of conformations.
    • subset: The name of the data subset the molecule is from.
    • smiles: The canonical SMILES string for the molecule. It includes explicit hydrogens and atom indices.
    • atomic_numbers: Array of length N containing the atomic number of every atom. They are ordered following the indices in the SMILES string.
    • conformations: Array of shape (M, N, 3) containing the atomic coordinates for every conformation.
    • dft_total_energy: Array of length M containing the energy of each conformation, without the dispersion correction. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
    • dft_total_gradient: Array of shape (M, N, 3) containing the gradient of the energy with respect to the atomic coordinates, without the dispersion correction. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
    • dispersion_correction_energy: Array of length M containing the dispersion correction energy of each conformation. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
    • dispersion_correction_gradient: Array of shape (M, N, 3) containing the gradient of the dispersion correction energy with respect to the atomic coordinates. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
    • total_energy: Array of length M containing the energy of each conformation, including the dispersion correction. This is only included in the OPTIMIZATION and TORSIONDRIVE dataset.
    • total_gradient: Array of shape (M, N, 3) containing the gradient of the energy, including the dispersion correction, with respect to the atomic coordinates. This is only included in the OPTIMIZATION and TORSIONDRIVE dataset.
    • rna_type: Category name of the RNA molecule (trinucleotide, base pair, or base triple). This is only include in RNA-DIVERSE-OPENFF-DEFAULT.hdf5.
  • All values are in atomic units. Distances are in bohr and energies in hartree.

Files

Files (2.6 GB)

Name Size Download all
md5:6c2001f2bd3ac7d30f272fa57d24566b
243.8 MB Download
md5:8e5bef2ceaf0444dc1088f655f546ca5
20.8 MB Download
md5:9bcaec20a7523661ddd00d06219cc166
1.0 GB Download
md5:558e9c8dae9f0701376bcd3fe90274a8
43.1 MB Download
md5:4d938f9082adb39d5e4d334c0ae7067c
15.5 MB Download
md5:2e5f8f316264a9ad5b178b1ec05d2ffe
10.5 MB Download
md5:8ad94cee38820c1d3cc7d2c0f213703a
141.4 MB Download
md5:eaec5e337bbbde1ee6918f322b5687dd
10.3 MB Download
md5:ea4ee05567b6411c0f5341dfef3aeede
57.3 MB Download
md5:54393fb866549ff6bb71112d51947104
1.0 GB Download