There is a newer version of the record available.

Published September 28, 2022 | Version 1.1.1
Dataset Open

SPICE 1.1.1

  • 1. Stanford University
  • 2. University of California, Irvine
  • 3. The Open Force Field Initiative
  • 4. Acellera Labs
  • 5. University of Notre Dame
  • 6. Newcastle University
  • 7. Memorial Sloan Kettering Cancer Center
  • 8. Virginia Polytechnic Institute and State University
  • 9. Weill Cornell Graduate School of Medical Sciences
  • 10. Universitat Pompeu Fabra

Description

SPICE (Small-Molecule/Protein Interaction Chemical Energies) is a collection of quantum mechanical data for training potential functions. The emphasis is particularly on simulating drug-like small molecules interacting with proteins. It is described in this publication:

Peter Eastman, Pavan Kumar Behara, David L. Dotson, Raimondas Galvelis, John E. Herr, Josh T. Horton, Yuezhi Mao, John D. Chodera, Benjamin P. Pritchard, Yuanqing Wang, Gianni De Fabritiis, and Thomas E. Markland. "SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials." https://doi.org/10.48550/arXiv.2209.10702 (2022).

The HDF5 file is structured as follows.

  • There is one top level group for each unique molecule or cluster. The name of each group is either a PubChem Substance ID (for PubChem molecules), an amino acid sequence (for dipeptides and solvated amino acids), or a SMILES string (for everything else).
  • Each group contains the following datasets. N is the number of atoms in the molecule and M is the number of conformations.  (Some groups may be missing some of them, for example if MBIS failed to converge.)
    • subset: The name of the data subset the molecule is from.
    • smiles: The canonical SMILES string for the molecule. It includes explicit hydrogens and atom indices.
    • atomic_numbers: Array of length N containing the atomic number of every atom. They are ordered following the indices in the SMILES string.
    • conformations: Array of shape (M, N, 3) containing the atomic coordinates for every conformation.
    • formation_energy: Array of length M containing the total energy of each conformation, minus the reference energies of the individual atoms when infinitely separated. This is the most useful energy for most purposes, since it contains all energy components that vary with atom positions but removes the large constant part corresponding to the internal energies of individual atoms.
    • dft_total_energy: Array of length M containing the energy of each conformation.
    • dft_total_gradient: Array of shape (M, N, 3) containing the gradient of the energy with respect to the atomic coordinates.
    • mbis_charges: Array of shape (M, N, 1) containing the MBIS charge of each atom.
    • mbis_dipoles: Array of shape (M, N, 3) containing the MBIS dipole of each atom.
    • mbis_quadrupoles: Array of shape (M, N, 3, 3) containing the MBIS quadrupole of each atom.
    • mbis_octupoles: Array of shape (M, N, 3, 3, 3) containing the MBIS octupole of each atom.
    • scf_dipoles: Array of shape (M, 3) containing the dipole of each molecule.
    • scf_quadrupole: Array of shape (M, 3, 3) containing the quadrupole of each molecule.
    • mayer_indices: Array of shape (M, N, N) containing the Mayer bond indices.
    • wiberg_lowdin_indices: Array of shape (M, N, N) containing the Wiberg bond indices using orthogonal Löwdin orbitals.
  • All values are in atomic units. Distances are in bohr and energies in hartree.

Files

Files (10.4 GB)

Name Size Download all
md5:5411e7014c6d18ff07d108c9ad820b53
10.4 GB Download