choderalab/download-qca-datasets: QM datasets deposited in QCArchive to train espaloma-0.3
Creators
Description
This is a collection of QM datasets deposited in the QCArchive to train espaloma-0.3. Extensive chemical molecules, such as small molecules, peptides, and nucleic acids, were downloaded, and their QM potential energies, forces, coordinates, atomic numbers, and canonical isomeric explicit hydrogen mapped SMILES are stored as HDF5 files. All QM data are computed with B3LYP-D3BJ/DZVP level of theory. Note that these datasets were downloaded from the legacy QCArchive server. The QM datasets are described in the following publication:
Kenichiro Takaba, Iván Pulido, Pavan Kumar Behara, Mike Henry, Hugo MacDermott Opeskin, John D. Chodera, Yuanqing Wang. "Machine-learned molecular mechanics force field for the simulation of protein-ligand systems and beyond" (arXiv:2307.07085)
The HDF5 file is structured as follows.
- There is one top level group for each unique molecule with a key name (molecular ID, an amino acid sequence, or a SMILES string).
- Each group contains the following datasets. N is the number of atoms in the molecule and M is the number of conformations.
- subset: The name of the data subset the molecule is from.
- smiles: The canonical SMILES string for the molecule. It includes explicit hydrogens and atom indices.
- atomic_numbers: Array of length N containing the atomic number of every atom. They are ordered following the indices in the SMILES string.
- conformations: Array of shape (M, N, 3) containing the atomic coordinates for every conformation.
- dft_total_energy: Array of length M containing the energy of each conformation, without the dispersion correction. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
- dft_total_gradient: Array of shape (M, N, 3) containing the gradient of the energy with respect to the atomic coordinates, without the dispersion correction. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
- dispersion_correction_energy: Array of length M containing the dispersion correction energy of each conformation. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
- dispersion_correction_gradient: Array of shape (M, N, 3) containing the gradient of the dispersion correction energy with respect to the atomic coordinates. This is NOT included in the OPTIMIZATION and TORSIONDRIVE dataset.
- total_energy: Array of length M containing the energy of each conformation, including the dispersion correction. This is only included in the OPTIMIZATION and TORSIONDRIVE dataset.
- total_gradient: Array of shape (M, N, 3) containing the gradient of the energy, including the dispersion correction, with respect to the atomic coordinates. This is only included in the OPTIMIZATION and TORSIONDRIVE dataset.
- rna_type: Category name of the RNA molecule (trinucleotide, base pair, or base triple). This is only include in RNA-DIVERSE-OPENFF-DEFAULT.hdf5.
- All values are in atomic units. Distances are in bohr and energies in hartree.
Files
Files
(2.6 GB)
Name | Size | Download all |
---|---|---|
md5:6c2001f2bd3ac7d30f272fa57d24566b
|
243.8 MB | Download |
md5:8e5bef2ceaf0444dc1088f655f546ca5
|
20.8 MB | Download |
md5:9bcaec20a7523661ddd00d06219cc166
|
1.0 GB | Download |
md5:558e9c8dae9f0701376bcd3fe90274a8
|
43.1 MB | Download |
md5:4d938f9082adb39d5e4d334c0ae7067c
|
15.5 MB | Download |
md5:2e5f8f316264a9ad5b178b1ec05d2ffe
|
10.5 MB | Download |
md5:8ad94cee38820c1d3cc7d2c0f213703a
|
141.4 MB | Download |
md5:eaec5e337bbbde1ee6918f322b5687dd
|
10.3 MB | Download |
md5:ea4ee05567b6411c0f5341dfef3aeede
|
57.3 MB | Download |
md5:54393fb866549ff6bb71112d51947104
|
1.0 GB | Download |