There is a newer version of the record available.

Published October 30, 2024 | Version v2
Dataset Open

Calculated state-of-the art results for solvation and ionization energies of thousands of organic molecules relevant to battery design

  • 1. ROR icon École Polytechnique Fédérale de Lausanne
  • 2. ROR icon University of Vienna
  • 3. ROR icon Uppsala University
  • 4. ROR icon University of Toronto

Description

This dataset presents molecular properties critical for battery electrolyte design, specifically solvation energies, ionization potentials, and electron affinities. The dataset is intended for use in machine learning model testing and algorithm validation. The properties calculated include solvation energies using the COSMO-RS method [1] and ionization potentials and electron affinities using various high-accuracy computational methods as implemented in MOLPRO [2]. Computational details can be found in Ref. [3], with scripts used to generate the data mostly uploaded to our github repository [4].

Molecular Datasets Considered:

  • QM9 Dataset: Contains small organic molecules broadly relevant for quantum chemistry [5]

  • Electrolyte Genome Project (EGP): Focuses on materials relevant to electrolytes.[6]

  • GDB17 and ZINC databases: Offer a broad chemical diversity with potential application in battery technologies. [7, 8]

Data structure

How to Load the Data:

All files can be loaded with


import json

with open("file.json", "r") as f:
    data_dict = json.load(f)


and the filestructure can be explored with

data_dict.keys()

We have also added an example script in python that shows how to extract all data from the JSON files following this link

How to extract the data

Note the file structure of the the AMONS JSON files is slightly different as explained below!

Solvation energies

The data is stored in two types of JSON archives: files for full molecules of GDB17 and ZINC and files for amons of GDB17 and ZINC. They are structured differently as amon entries are sorted by the number of heavy atoms in the amon (e.g., all amons with 3 heavy atoms are stored in ni3). Because of the large number of amons with 6 or 7 heavy atoms,they are further split into ni6_1, ni6_2, and so on. A sub dictionary of an amon dictionary or a full molecule dictionary contains the following keys:

ECFP - ECFP4 representation vector

SMILES - SMILES string

SYMBOLS - atomic symbols

COORDS - atomic positions in Angstrom

ATOMIZATION - atomization energy in [kcal/mol]

DIPOLE - dipole moment in Debye

ENERGY - energy in Hartree

SOLVATION - solvation energy in [kcal/mol] for different solvents at 300 K.

 

Files:

 

GDB17.json.zip (unpack with unzip first with unzip GDB17.json.zip) - subset of GDB17 random molecules

AMONS_ZINC.json - all amons of ZINC up to 7 heavy atoms

EGP.json  - EGP molecules

AMONS_GDB17.json - all amons of GDB17 up to 7 heavy atoms

 

File Name Description  Molecules
AMONS_GDB17.json GDB17 amons 37860
AMONS_ZINC.json ZINC amons    88771
GDB17.json Subset of GDB17 309468
EGP.json  EGP molecules    18362

Atomic energies $E_{at}$ at BP and def2-TZVPD level in Hartree [Ha]

Element H C N O F Br Cl S P
Eat [Ha] -0.5  -37.85  -54.60  -75.09 -99.77 -2574.40  -460.20  -398.16 -341.30

 

B Si
  -24.65  -289.40

We follow the convention of negative atomization energies for stablity compared to the isolated atoms:

$E_{atomization} = E_{mol} - \sum_{i} E_{at,i}$


Free energy of solvation at 300 K in [kcal/mol]:

Ionization potentials and electron affinities

The upload contains two JSON files, QM9IPEA.json and QM9IPEA_atom_ens.json. QM9IPEA.json summarizes MOLPRO calculation data grouping it along the following dictionary keys:

COORDS - atom coordinates in Angstroms.

SYMBOLS - atom element symbols.

ENERGY - total energies for each charge (0, -1, 1) and method considered.

CPU_TIME - CPU times (in seconds) spent at each step of each part of the calculation.

DISK_USAGE - highest total disk usage in GB.

ATOMIZATION_ENERGY - atomization energy at charge 0.

QM9_ID - ID of the molecule in the QM9 dataset.

 

All energies are given in Hartrees with NaN indicating the calculation failed to converge. Ionization potentials and electron affinities can be recovered as energy differences between neutral and charged (+1 for ionization potentials, -1 for electron affinities) species.

"CPU_time" entries contain steps corresponding to individual method calculations, as well as steps corresponding to program operation: "INT" (calculating integrals over basis functions relevant for the calculation), "FILE" (dumping intermediate data to restart file), and "RESTART" (importing restart data). The latter two steps appeared since we reused relevant integrals calculated for neutral species in charged species' calculations; we also used restart functionality to use HF density matrix obtained for the neutral species as the initial density matrix guess for the SCF-HF calculation for charged species. NaN CPU time value means the step was not present or that the calculation is invalid. Note that the CPU times were measured while parallelizing on 12 cores and were not adjusted to single-core.

 

QM9IPEA_atom_ens.json contains atomic energies used to calculate atomization energies in QM9IPEA.json, the dictionary keys are:

SPINS - the spin assigned to elements during calculations of atomic energies.

ENERGY - energies of atoms using different methods.

 

(Note that H has only one electron and thus does not require a level of theory beyond Hartree-Fock.)

NOTE: Additional calculations were performed between publication of arXiv:2308.11196 and creation of this upload. For the version of the dataset used in the manuscript, please refer to DOI:10.5281/zenodo.8252498.

Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 957189 (BIG-MAP) and  No. 957213 (BATTERY 2030+). O.A.v.L. has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 772834). O.A.v.L. has received support as the Ed Clark Chair of Advanced Materials and as a Canada CIFAR AI Chair. O.A.v.L. acknowledges that this research is part of the University of Toronto’s Acceleration Consortium, which receives funding from the Canada First Research Excellence Fund (CFREF). Obtaining the presented computational results has been facilitated using the queueing system implemented at https://leruli.com. The project has been supported by the Swedish Research Council (Vetenskapsrådet), and the Swedish National Strategic e-Science program eSSENCE as well as by computing resources from the Swedish National Infrastructure for Computing (SNIC/NAISS).

 

References

[1] Klamt, A.; Eckert, F. COSMO-RS: a novel and efficient method for the a priori prediction of thermophysical data of liquids. Fluid Phase Equilibria 2000, 172, 43–72

[2] Werner, H.-J.; Knowles, P. J.; Knizia, G.; Manby, F. R.; Schutz, M. Molpro: a general-purpose quantum chemistry program package. WIREs Comput. Mol. Sci. 2012, 2, 242–253

[3] arxiv link of draft

[4] https://github.com/chemspacelab/ViennaUppDa

[5] Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 2014, 1, 140022

[6] Qu, X.; Jain, A.; Rajput, N. N.; Cheng, L.; Zhang, Y.; Ong, S. P.; Brafman, M.; Mag- inn, E.; Curtiss, L. A.; Persson, K. A. The Electrolyte Genome Project: A big data approach in battery materials discovery. Comput. Mater. Sci. 2015, 103, 56–67

 [7] Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enu- meration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. Journal of Chemical Information and Modeling 2012, 52, 2864–2875

[8] Irwin, J. J.; Shoichet, B. K. ZINC A Free Database of Commercially Available Compounds for Virtual Screening. Journal of Chemical Information and Modeling 2005, 45, 177–182.

Files

AMONS_GDB17.json

Files (5.3 GB)

Name Size Download all
md5:91bcf59bebbeace16d75edcc6d387e46
757.2 MB Preview Download
md5:78adb8de46d36751e04275a432c6150a
2.1 GB Preview Download
md5:7d5f186e09a1e3e278d4e92bc15f88ae
470.7 MB Preview Download
md5:8e597ee94d4d0b95f46d599dc5d15b11
2.0 GB Preview Download
md5:411cb481741e9053995531a9c427278f
10.3 MB Preview Download
md5:04974652ac61563d399b3f593e8b9660
905 Bytes Preview Download

Additional details

Funding

European Commission
BIG MAP 957189
European Commission
BATTERY 2030+ 957213

Software

Repository URL
https://github.com/chemspacelab/ViennaUppDa
Programming language
Python