Published November 22, 2018 | Version 1.0
Dataset Open

Group contribution models for heat of formation (SUB 2018)

  • 1. University of Plovdiv, Department of Analytical Chemistry and Computer Chemistry, 24 Tsar Assen St., Plovdiv 4000, Bulgaria

Description

We present a set of group contribution models for predicting heat of formation of organic compounds. A dataset containing 1004 molecular structures from DIPPR database was split into a learning and a test sets further used for model training and validation. The model building was performed with software Ambit-GCM (https://doi.org/10.5281/zenodo.1470793). A set of preliminary models were build according to various fragmentation schemes, with and without use of correction factors and external descriptors. Different orders of additive schemes were studied. Every model in the set was validated using leave-one-out procedure and Y-scrambling technique as well as model performances were tested using the external dataset. The best five models full data and corresponding statistical characteristics are available in models.zip. The model 2 is available also as a JSON file in the archive and can be used for theoretical prediction of heat of formation.

To use the model 2 please download gcm-predict.jar from https://doi.org/10.5281/zenodo.1470793). Example application of gcm-predict (Ambit-GCM) module for a single structure is given below:

java -jar gcm-predict-v1.1.jar -s CC(C)OCC(C)O -c model_2.json
GCM value (Hf) for CC(C)OCC(C)O is -528.7163407123614

The gcm-predict (Ambit-GCM) module can also be applied for a set of structures. An example follows with 5 molecules inputted as a *.csv file:
 
java -jar gcm-predict-v1.1.jar -i Prediction_Examples.csv -c model_2.json
GCM calculateting property Hf for 5 molecules ...
Mol#,ModelValue(Hf),SMILES,CalcStatus
1,-126.23055043388635,CCCC,OK
2,-353.4883670381994,c1ccc(c(c1)O)O,OK
3,-524.1758445616103,CCC(CO)O,OK
4,-220.91676720730212,CC(Cc1ccccc1)O,OK
5,-728.1309488149211,C(C(Cl)(Cl)F)(F)F,OK

The output lines contain:  molecule number, predicted Hf, SMILES, and calculation status

The full data for training and validation is available in learn-test-sets.zip. Data can be used for retraining or improving the models.

More examples for using Ambit-GCM software for group contribution modeling and property prediction are available in https://doi.org/10.5281/zenodo.1471646

Files

learn-test-sets.zip

Files (3.1 MB)

Name Size Download all
md5:734a66692762e9dc51d6909373aee92b
2.8 MB Preview Download
md5:272c495d6d6f526480962041646b45db
216.4 kB Preview Download
md5:fb04eabf857b774dc58f23d918357e38
130 Bytes Preview Download
md5:5b1b6cab013d6c68cb2f9786b686d888
58.2 kB Preview Download