Calculated state-of-the art results for solvation and ionization energies of thousands of organic molecules relevant to battery design
Creators
Description
This dataset presents molecular properties critical for battery electrolyte design, specifically solvation energies, ionization potentials, and electron affinities. The dataset is intended for use in machine learning model testing and algorithm validation. The properties calculated include solvation energies using the COSMO-RS method [1] and ionization potentials and electron affinities using various high-accuracy computational methods as implemented in MOLPRO [2]. Computational details can be found in Ref. [3], with scripts used to generate the data mostly uploaded to our github repository [4].
Molecular Datasets Considered:
-
QM9 Dataset: Contains small organic molecules broadly relevant for quantum chemistry [5]
-
Electrolyte Genome Project (EGP): Focuses on materials relevant to electrolytes.[6]
-
GDB17 and ZINC databases: Offer a broad chemical diversity with potential application in battery technologies. [7, 8]
Data structure
How to Load the Data:
All files can be loaded with
import json
with open("file.json", "r") as f:
data_dict = json.load(f)
and the filestructure can be explored with
data_dict.keys()
We have also added an example script in python that shows how to extract all data from the JSON files following this link
Note the file structure of the the AMONS JSON files is slightly different as explained below!
Solvation energies
The data is stored in two types of JSON archives: files for full molecules of GDB17 and ZINC and files for amons of GDB17 and ZINC. They are structured differently as amon entries are sorted by the number of heavy atoms in the amon (e.g., all amons with 3 heavy atoms are stored in ni3
). Because of the large number of amons with 6 or 7 heavy atoms,they are further split into ni6_1
, ni6_2
, and so on. A sub dictionary of an amon dictionary or a full molecule dictionary contains the following keys:
ECFP
- ECFP4 representation vector
SMILES
- SMILES string
SYMBOLS
- atomic symbols
COORDS
- atomic positions in Angstrom
ATOMIZATION
- atomization energy in [kcal/mol]
DIPOLE
- dipole moment in Debye
ENERGY
- energy in Hartree
SOLVATION
- solvation energy in [kcal/mol] for different solvents at 300 K.
Files:
GDB17.json.zip
(unpack with unzip first with unzip GDB17.json.zip
) - subset of GDB17 random molecules
AMONS_ZINC.json
- all amons of ZINC up to 7 heavy atoms
EGP.json
- EGP molecules
AMONS_GDB17.json
- all amons of GDB17 up to 7 heavy atoms
File Name | Description | Molecules |
AMONS_GDB17.json | GDB17 amons | 37860 |
AMONS_ZINC.json | ZINC amons | 88771 |
GDB17.json | Subset of GDB17 | 309468 |
EGP.json | EGP molecules | 18362 |
Atomic energies $E_{at}$ at BP and def2-TZVPD level in Hartree [Ha]
Element | H | C | N | O | F | Br | Cl | S | P |
Eat [Ha] | -0.5 | -37.85 | -54.60 | -75.09 | -99.77 | -2574.40 | -460.20 | -398.16 | -341.30 |
B | Si |
-24.65 | -289.40 |
We follow the convention of negative atomization energies for stablity compared to the isolated atoms:
$E_{atomization} = E_{mol} - \sum_{i} E_{at,i}$
Free energy of solvation at 300 K in [kcal/mol]:
Ionization potentials and electron affinities
The upload contains two JSON files, QM9IPEA.json and QM9IPEA_atom_ens.json. QM9IPEA.json summarizes MOLPRO calculation data grouping it along the following dictionary keys:
COORDS
- atom coordinates in Angstroms.
SYMBOLS
- atom element symbols.
ENERGY
- total energies for each charge (0, -1, 1) and method considered.
CPU_TIME
- CPU times (in seconds) spent at each step of each part of the calculation.
DISK_USAGE
- highest total disk usage in GB.
ATOMIZATION_ENERGY
- atomization energy at charge 0.
QM9_ID
- ID of the molecule in the QM9 dataset.
All energies are given in Hartrees with NaN indicating the calculation failed to converge. Ionization potentials and electron affinities can be recovered as energy differences between neutral and charged (+1 for ionization potentials, -1 for electron affinities) species.
"CPU_time" entries contain steps corresponding to individual method calculations, as well as steps corresponding to program operation: "INT" (calculating integrals over basis functions relevant for the calculation), "FILE" (dumping intermediate data to restart file), and "RESTART" (importing restart data). The latter two steps appeared since we reused relevant integrals calculated for neutral species in charged species' calculations; we also used restart functionality to use HF density matrix obtained for the neutral species as the initial density matrix guess for the SCF-HF calculation for charged species. NaN CPU time value means the step was not present or that the calculation is invalid. Note that the CPU times were measured while parallelizing on 12 cores and were not adjusted to single-core.
QM9IPEA_atom_ens.json contains atomic energies used to calculate atomization energies in QM9IPEA.json, the dictionary keys are:
SPINS
- the spin assigned to elements during calculations of atomic energies.
ENERGY
- energies of atoms using different methods.
(Note that H has only one electron and thus does not require a level of theory beyond Hartree-Fock.)
NOTE: Additional calculations were performed between publication of arXiv:2308.11196 and creation of this upload. For the version of the dataset used in the manuscript, please refer to DOI:10.5281/zenodo.8252498.
Acknowledgement
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 957189 (BIG-MAP) and No. 957213 (BATTERY 2030+). O.A.v.L. has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 772834). O.A.v.L. has received support as the Ed Clark Chair of Advanced Materials and as a Canada CIFAR AI Chair. O.A.v.L. acknowledges that this research is part of the University of Toronto’s Acceleration Consortium, which receives funding from the Canada First Research Excellence Fund (CFREF). Obtaining the presented computational results has been facilitated using the queueing system implemented at https://leruli.com. The project has been supported by the Swedish Research Council (Vetenskapsrådet), and the Swedish National Strategic e-Science program eSSENCE as well as by computing resources from the Swedish National Infrastructure for Computing (SNIC/NAISS).
References
[1] Klamt, A.; Eckert, F. COSMO-RS: a novel and efficient method for the a priori prediction of thermophysical data of liquids. Fluid Phase Equilibria 2000, 172, 43–72
[2] Werner, H.-J.; Knowles, P. J.; Knizia, G.; Manby, F. R.; Schutz, M. Molpro: a general-purpose quantum chemistry program package. WIREs Comput. Mol. Sci. 2012, 2, 242–253
[3] arxiv link of draft
[4] https://github.com/chemspacelab/ViennaUppDa
[5] Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 2014, 1, 140022
[6] Qu, X.; Jain, A.; Rajput, N. N.; Cheng, L.; Zhang, Y.; Ong, S. P.; Brafman, M.; Mag- inn, E.; Curtiss, L. A.; Persson, K. A. The Electrolyte Genome Project: A big data approach in battery materials discovery. Comput. Mater. Sci. 2015, 103, 56–67
[7] Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enu- meration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. Journal of Chemical Information and Modeling 2012, 52, 2864–2875
[8] Irwin, J. J.; Shoichet, B. K. ZINC A Free Database of Commercially Available Compounds for Virtual Screening. Journal of Chemical Information and Modeling 2005, 45, 177–182.
Files
AMONS_GDB17.json
Files
(5.3 GB)
Name | Size | Download all |
---|---|---|
md5:91bcf59bebbeace16d75edcc6d387e46
|
757.2 MB | Preview Download |
md5:78adb8de46d36751e04275a432c6150a
|
2.1 GB | Preview Download |
md5:7d5f186e09a1e3e278d4e92bc15f88ae
|
470.7 MB | Preview Download |
md5:8e597ee94d4d0b95f46d599dc5d15b11
|
2.0 GB | Preview Download |
md5:411cb481741e9053995531a9c427278f
|
10.3 MB | Preview Download |
md5:04974652ac61563d399b3f593e8b9660
|
905 Bytes | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/chemspacelab/ViennaUppDa
- Programming language
- Python