choderalab/download-qca-datasets: QM datasets deposited in QCArchive to train espaloma-0.3

Kenichiro Takaba

doi:10.5281/zenodo.8148817

Published July 14, 2023 | Version 1.0.0

Dataset Open

choderalab/download-qca-datasets: QM datasets deposited in QCArchive to train espaloma-0.3

Kenichiro Takaba

This is a collection of QM datasets deposited in the QCArchive to train espaloma-0.3. Extensive chemical molecules, such as small molecules, peptides, and nucleic acids, were downloaded, and their QM potential energies, forces, coordinates, atomic numbers, and canonical isomeric explicit hydrogen mapped SMILES are stored as HDF5 files. All QM data are computed with B3LYP-D3BJ/DZVP level of theory. Note that these datasets were downloaded from the legacy QCArchive server. The QM datasets are described in the following publication:

Kenichiro Takaba, Iván Pulido, Pavan Kumar Behara, Mike Henry, Hugo MacDermott Opeskin, John D. Chodera, Yuanqing Wang. "Machine-learned molecular mechanics force field for the simulation of protein-ligand systems and beyond" (arXiv:2307.07085)

The HDF5 file is structured as follows.

There is one top level group for each unique molecule with a key name (molecular ID, an amino acid sequence, or a SMILES string).
Each group contains the following datasets. N is the number of atoms in the molecule and M is the number of conformations.
- subset: The name of the data subset the molecule is from.
- smiles: The canonical SMILES string for the molecule. It includes explicit hydrogens and atom indices.
- atomic_numbers: Array of length N containing the atomic number of every atom. They are ordered following the indices in the SMILES string.
- conformations: Array of shape (M, N, 3) containing the atomic coordinates for every conformation.
- dft_total_energy: Array of length M containing the energy of each conformation, without the dispersion correction. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
- dft_total_gradient: Array of shape (M, N, 3) containing the gradient of the energy with respect to the atomic coordinates, without the dispersion correction. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
- dispersion_correction_energy: Array of length M containing the dispersion correction energy of each conformation. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
- dispersion_correction_gradient: Array of shape (M, N, 3) containing the gradient of the dispersion correction energy with respect to the atomic coordinates. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
- total_energy: Array of length M containing the energy of each conformation, including the dispersion correction. This is only included in the OPTIMIZATION and TORSIONDRIVE dataset.
- total_gradient: Array of shape (M, N, 3) containing the gradient of the energy, including the dispersion correction, with respect to the atomic coordinates. This is only included in the OPTIMIZATION and TORSIONDRIVE dataset.
- rna_type: Category name of the RNA molecule (trinucleotide, base pair, or base triple). This is only include in RNA-DIVERSE-OPENFF-DEFAULT.hdf5.
All values are in atomic units. Distances are in bohr and energies in hartree.

Files

Files (2.6 GB)

Name	Size	Download all
GEN2-OPTIMIZATION-DATASET-OPENFF-DEFAULT.hdf5 md5:6c2001f2bd3ac7d30f272fa57d24566b	243.8 MB	Download
GEN2-TORSIONDRIVE-OPENFF-DEFAULT.hdf5 md5:8e5bef2ceaf0444dc1088f655f546ca5	20.8 MB	Download
PEPCONF-DLC-OPTIMIZATION-DATASET-OPENFF-DEFAULT.hdf5 md5:9bcaec20a7523661ddd00d06219cc166	1.0 GB	Download
PROTEIN-TORSIONDRIVE-OPENFF-DEFAULT.hdf5 md5:558e9c8dae9f0701376bcd3fe90274a8	43.1 MB	Download
RNA-DIVERSE-OPENFF-DEFAULT.hdf5 md5:4d938f9082adb39d5e4d334c0ae7067c	15.5 MB	Download
RNA-NUCLEOSIDE-OPENFF-DEFAULT.hdf5 md5:2e5f8f316264a9ad5b178b1ec05d2ffe	10.5 MB	Download
RNA-TRINUCLEOTIDE-OPENFF-DEFAULT.hdf5 md5:8ad94cee38820c1d3cc7d2c0f213703a	141.4 MB	Download
SPICE-DES-MONOMERS-OPENFF-DEFAULT.hdf5 md5:eaec5e337bbbde1ee6918f322b5687dd	10.3 MB	Download
SPICE-DIPEPTIDE-OPENFF-DEFAULT.hdf5 md5:ea4ee05567b6411c0f5341dfef3aeede	57.3 MB	Download
SPICE-PUBCHEM-OPENFF-DEFAULT.hdf5 md5:54393fb866549ff6bb71112d51947104	1.0 GB	Download

	All versions	This version
Views	202	201
Downloads	217	216
Data volume	77.8 GB	76.8 GB

choderalab/download-qca-datasets: QM datasets deposited in QCArchive to train espaloma-0.3

Creators

Description

Files

Files (2.6 GB)