Published November 5, 2020 | Version 1.0.0
Dataset Open

Molecular geometries and energies from quantum mechanical calculations and small molecule force field evaluations.

  • 1. Department of Chemistry, University of California, Irvine
  • 2. Computational Chemistry, Janssen Research \& Development, Turnhoutseweg 30, Beerse B-2340, Belgium
  • 3. OpenEye Scientific, Santa Fe, NM 87507

Description

Force fields are used in a wide variety of contexts for classical molecular simulation, including studies on protein-ligand binding, membrane permeation, and thermophysical property prediction.
The quality of these studies relies on the quality of the force fields used to represent the systems. 
Focusing on small molecules of fewer than 50 heavy atoms, this data compares nine force fields: GAFF, GAFF2, MMFF94, MMFF94S, OPLS3e, SMIRNOFF99Frosst, and the Open Force Field Parsley, versions 1.0, 1.1, and 1.2.
On a dataset comprising 22,675 molecular structures of 3,271 molecules, we analyzed force field-optimized geometries and conformer energies compared to reference quantum mechanical (QM) data.

The data was created using scripts of the  benchmarkff github repository.

A corresponding manuscript is submitted, a preprint is available on ChemRxiv:
Lim, Victoria T.; Hahn, David F.; Tresadern, Gary; Bayly, Christopher I.; Mobley, David (2020): Benchmark Assessment of Molecular Geometries and Energies from Small Molecule Force Fields. ChemRxiv. Preprint

Read below or the file README.md for further information and description of the content:

# README

Version: 04 Nov 2020

For Python scripts that are NOT found in these directories, please check the 
[BenchmarkFF Github repo](https://github.com/MobleyLab/benchmarkff/tree/master/tools).

## Procedure

1. Prep OPLS3e file for analysis: standardize format by OpenEye in case of differences 
and convert from kJ/mol to kcal/mol.
```
cd prep
python convert_extension.py -i opls3e_minimized.sd -o opls3e.sdf
```

2. Remove mols that couldn't parameterize by ALL FFs.
```
python get_by_tag.py -i opls3e.sdf -s "SMILES QCArchive" -list trim3.txt -o trim3_full_opls3e.sdf
```

3. Run analysis.
```
conda activate parsley
# calc ddE, RMSD, and TFD distributions
python compare_ffs.py -i match.in -t 'SMILES QCArchive' --plot > metrics.out
# match_minima, only in 01_analysis_all and 02_analysis_all_smaller_cutoff
python match_minima.py -i match.in --plot --cutoff 1.0 --readpickle
# look at specific subsets, only in 01_analysis_all
python color_by_moiety.py -i match.in -p metrics.pickle -s N-N.dat azetidine.dat octahydrotetracene.dat -o scatter_tfd_3_ 
# look at outliers,only in 01_analysis_all and 02_analysis_all_smaller_cutoff
python tailed_parameters.py -i refdata_trim_overlap_full_openff_unconstrained-1.2.0.sdf  -f <offxml file>  --metric 'TFD' --cutoff 0.12 --tag "TFD to trim_overlap_full_qcarchive.sdf" --tag_smiles "SMILES QCArchive" > output_tfd.dat

```

## Brief description of contents

* High level:
```
.
├── 00_prep
│   ├── convert_extension.py
│   ├── opls3e_minimized.sd         OPLS3e minimized structures from Schrodinger Maestro
│   ├── opls3e.sdf                  standardized through OpenEye tools
│   ├── opt_openff*.sdf		    OpenFF minimized conformations
├── 01_analysis_all                        compare all ffs (qm, GAFF(2), MMFF94(S), Smirnoff, OpenFF-X.X, OPLS3e)
├── 02_analysis_all_smaller_cutoff	   compare all ffs (qm, GAFF(2), MMFF94(S), Smirnoff, OpenFF-X.X, OPLS3e) with a smaller cutoff of .3 for match_minima
├── 03_analysis_latest_ffs  		   compare only the latest versions of ffs (qm, GAFF2, MMFF94S, OpenFF-1.2, OPLS3e)
├── 04_analysis_openff_only		   compare only OpenFF ffs (qm,  Smirnoff, OpenFF-X.X)
└── README.md
```

* Inside an output directory:
```
YY_analysis_*			various output files of above mentioned scripts, some are listed and described below:
├── bar*.png			    parameter coverage bar plots
├── ddE.dat			    relative energies data
├── fig_density_*.png                   scatter plots of ddE vs (RMSD or TFD) for each force field
├── match.in                        input file for compare_ffs.py
├── metrics.out                     output file for compare_ffs.py
├── metrics.pickle                  pickle file for compare_ffs.py -- you can read this into compare_ffs instead of rerunning the full analysis
├── refdata_*.sdf                   output SDF files with stored RMSD / TFD scores with reference to QM for each structure
├── relene_*.dat		    relative energies of matched conformers
├── ridge_dde.png                   compared energies plot
├── ridge_rmsd.svg                  compared rmsds plot
├── ridge_tfd.svg                   compared tfds plot
├── fig_scatter_*.png               scatter plots of ddE vs (RMSD or TFD). these are noisy; I don't use these
├── trim3_*.sdf                     input SDF files for compare_ffs.py listed in match.in file
├── violin*.*                       violin plot showing ddE distributions 
```

Files

Files (893.8 MB)

Name Size Download all
md5:7c2f9446743ac24aeb59ae7ec5e37276
893.8 MB Download

Additional details

Funding

National Institutes of Health
Alchemical free energy methods for efficient drug lead optimization 1R01GM108889-01
National Institutes of Health
Advancing predictive physical modeling through focused development of model systems to drive new modeling innovations 1R01GM124270-01A1