ATMOMACCS: An interpretable molecular descriptor for machine learning predictions in atmospheric science
Authors/Creators
- 1. Aalto University, Department of Applied Physics
Description
ATMOMACCS-KRR
ATMOMACCS-KRR is a research project exploring molecular descriptor construction and Kernel Ridge Regression (KRR) modeling for predicting thermodynamic and physicochemical properties of atmospheric organic molecules.
The project introduces the ATMOMACCS descriptors (versions 1–5), which extend the traditional MACCS fingerprint to better represent structural features relevant in atmospheric chemistry.
Project outline
Main Script:
-
main.py
End-to-end pipeline: generates descriptors, trains a KRR model, and produces predictions.
Descriptor Construction:
-
ATMOMACCS v1–v4:
ATMOMACCS.py
Implements versions v1–v4 of ATMOMACCS descriptors. -
ATMOMACCS v5:
ATMOMACCS_no_binary.py
Implements version v5 of ATMOMACCS.
Descriptor Generation Scripts:
-
generate_ATMOMACCS.py
Generate ATMOMACCS descriptors for a dataset. -
generate_MACCS.py
Generate standard MACCS keys as baseline descriptors. -
generate_optimal_topfp.py
Generate optimized topological fingerprints. -
generate_topological.py
Generate topological fingerprints.
Results Processing:
-
process_results.py
Aggregates results across multiple random seeds and produces mean and standard deviation CSV summaries for each descriptor/target combination.
Getting Started
Example run with main.py:
python src/main.py -v 4 -d data/Wang -t log_p_sat.txt -ds Wang -s 2435
This will:
-
Generate descriptors (ATMOMACCS v4 in this case)
-
Train a KRR model on the chosen dataset and target
-
Save predictions and performance metrics to the results folder
Requirements
This project was developed and tested with Python 3.12.0
Dependencies:
matplotlib==3.8.0
pandas==2.1.1
rdkit==2023.9.1
scikit-learn==1.3.2
scipy==1.11.3
Licensing and Data Sources
The ATMOMACCS-KRR code (scripts, descriptor generation, and processing workflows) is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) if not otherwise specified. This repository redistributes and processes several benchmark datasets. However, the datasets included in this repository (Ferraz-Caetano, GeckoQ, Wang, Li, Kruger-confined, Kruger-broad) are redistributed under their original licenses. Modifications and preprocessing applied to the datasets are documented in this repository. Note that the original data licenses must be respected if the datasets are reused, redistributed, or adapted. Please see the text section below for full attribution and licensing information for each dataset.
Ferraz-Caetano:
-
License: MIT License https://github.com/jfcaetano/VOC-EnthVapML/blob/main/LICENSE
-
Original data file: VOC-Database.csv
-
Source: https://github.com/jfcaetano/VOC-EnthVapML/tree/main/Database
-
Publication: José Ferraz-Caetano, Filipe Teixeira, M. Natália D.S. Cordeiro, Data-driven, explainable machine learning model for predicting volatile organic compounds’ standard vaporization enthalpy, Chemosphere, Volume 359, 142257 (2024)
DOI: https://doi.org/10.1016/j.chemosphere.2024.142257
GeckoQ:
-
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
-
Original data file: Dataframe.csv
-
Source: https://doi.org/10.23729/022475cc-e527-41a9-bbc0-0113923cf04c
-
Publication: Vitus Besel, Milica Todorović, Theo Kurtén, Patrick Rinke & Hanna Vehkamäki, Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules, Scientific Data 10, 450 (2023)
DOI: https://doi.org/10.1038/s41597-023-02366-x
Wang:
-
License: Creative Commons Attribution 3.0 Unported (CC BY 3.0)
-
Original data file: Supplementary data (acp-17-7529-2017-supplement.zip)
-
Source: https://acp.copernicus.org/articles/17/7529/2017/acp-17-7529-2017-supplement.zip
-
Publication: Chen Wang, Tiange Yuan, Stephen A. Wood, Kai-Uwe Goss, Jingyi Li, Qi Ying, and Frank Wania, Uncertain Henry's law constants compromise equilibrium partitioning calculations of atmospheric oxidation products, Atmos. Chem. Phys., 17, 7529–7540 (2017)
DOI: https://doi.org/10.5194/acp-17-7529-2017 -
Modifications: Documented in data_preprocessing.ipynb (formatting and restructuring of files)
Li:
-
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
-
Original data file: Li et al. OA viscosity_Table S2.xls
-
Publication: Ying Li, Douglas A. Day, Harald Stark, Jose L. Jimenez, and Manabu Shiraiwa, Predictions of the glass transition temperature and viscosity of organic aerosols from volatility distributions, Atmos. Chem. Phys., 20, 8103–8122 (2020)
DOI: https://doi.org/10.5194/acp-20-8103-2020 -
Modifications: Documented in CAS_to_smiles.ipynb (formatting and restructuring of files)
Kruger-confined and Kruger-broad:
- License: Creative Commons Attribution 4.0 International (CC BY 4.0)
- Original data files;
new_confined_data_training_info_noH.pickle,
new_broad_data_training_info_noH.pickle - Source: https://doi.org/10.17617/3.GIKHJL
- Publication: Matteo Krüger, Tommaso Galeazzo, Ivan Eremets, Bertil Schmidt, Ulrich Pöschl, Manabu Shiraiwa, and Thomas Berkemeier, Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC2NN), Geosci. Model Dev., 18, 7357–7371 (2025)
DOI: https://doi.org/10.5194/gmd-18-7357-2025
Files
ATMOMACCS.zip
Files
(141.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:da954268e37f0d5b7d7831791a65e837
|
1.6 MB | Preview Download |
|
md5:e8533469a9a0fdad3b8e55501b8d6c8b
|
140.2 MB | Preview Download |
Additional details
Related works
- Is variant form of
- Poster: 10.5194/egusphere-egu25-5719 (DOI)
Funding
- Research Council of Finland
- Virtual laboratory for molecular level atmospheric transformations / Consortium: VILMA 346376
Dates
- Available
-
2025-10-21Preprint
Software
- Programming language
- Python
References
- Ferraz-Caetano, J., Teixeira, F., Cordeiro, M.N.D.S., Data-driven, explainable machine learning model for predicting volatile organic compounds' standard vaporization enthalpy, Chemosphere, 359, 142257 (2024), https://doi.org/10.1016/j.chemosphere.2024.142257
- Besel, V., Todorović, M., Kurtén, T., Rinke, P., Vehkamäki, H., Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules, Sci Data 10, 450 (2023), https://doi.org/10.1038/s41597-023-02366-x
- Wang, C., Yuan, T., Wood, S.A., Goss, K.-U., Li, J., Ying, Q., Wania, F., Uncertain Henry's law constants compromise equilibrium partitioning calculations of atmospheric oxidation products, Atmos. Chem. Phys., 17, 7529–7540 (2017), https://doi.org/10.5194/acp-17-7529-2017
- Li, Y., Day, D.A., Stark, H., Jimenez, J.L., Shiraiwa, M., Predictions of the glass transition temperature and viscosity of organic aerosols from volatility distributions, Atmos. Chem. Phys., 20, 8103–8122 (2020), https://doi.org/10.5194/acp-20-8103-2020
- Krüger, M., Galeazzo, T., Eremets, I., Schmidt, B., Pöschl, U., Shiraiwa, M., and Berkemeier, T., Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC2NN), Geosci. Model Dev., 18, 7357–7371, (2025), https://doi.org/10.5194/gmd-18-7357-2025
- RDKit: Open-source cheminformatics. https://www.rdkit.org