Published February 28, 2026 | Version v3
Software Open

ATMOMACCS: An interpretable molecular descriptor for machine learning predictions in atmospheric science

  • 1. Aalto University, Department of Applied Physics

Contributors

  • 1. EDMO icon Technical University of Munich
  • 2. Munich Data Science Institute, Technical University of Munich, Atomistic Modelling Center
  • 3. Technical University of Munich, TUM School of Natural Sciences, Physics Department

Description

ATMOMACCS-KRR

ATMOMACCS-KRR is a research project exploring molecular descriptor construction and Kernel Ridge Regression (KRR) modeling for predicting thermodynamic and physicochemical properties of atmospheric organic molecules.

The project introduces the ATMOMACCS descriptors (versions 1–5), which extend the traditional MACCS fingerprint to better represent structural features relevant in atmospheric chemistry.

Project outline

Main Script:

  • main.py
    End-to-end pipeline: generates descriptors, trains a KRR model, and produces predictions.

Descriptor Construction:

  • ATMOMACCS v1–v4:
    ATMOMACCS.py
    Implements versions v1–v4 of ATMOMACCS descriptors.

  • ATMOMACCS v5:
    ATMOMACCS_no_binary.py
    Implements version v5 of ATMOMACCS.

Descriptor Generation Scripts:

  • generate_ATMOMACCS.py
    Generate ATMOMACCS descriptors for a dataset.

  • generate_MACCS.py
    Generate standard MACCS keys as baseline descriptors.

  • generate_optimal_topfp.py
    Generate optimized topological fingerprints.

  • generate_topological.py
    Generate topological fingerprints.

Results Processing:

  • process_results.py
    Aggregates results across multiple random seeds and produces mean and standard deviation CSV summaries for each descriptor/target combination.

Getting Started

Example run with main.py:

python src/main.py -v 4 -d data/Wang -t log_p_sat.txt -ds Wang -s 2435

This will:

  • Generate descriptors (ATMOMACCS v4 in this case)

  • Train a KRR model on the chosen dataset and target

  • Save predictions and performance metrics to the results folder

Requirements

This project was developed and tested with Python 3.12.0

Dependencies:

  • matplotlib==3.8.0

  • pandas==2.1.1

  • rdkit==2023.9.1

  • scikit-learn==1.3.2

  • scipy==1.11.3

Licensing and Data Sources

The ATMOMACCS-KRR code (scripts, descriptor generation, and processing workflows) is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) if not otherwise specified. This repository redistributes and processes several benchmark datasets. However, the datasets included in this repository (Ferraz-Caetano, GeckoQ, Wang, Li, Kruger-confined, Kruger-broad) are redistributed under their original licenses. Modifications and preprocessing applied to the datasets are documented in this repository. Note that the original data licenses must be respected if the datasets are reused, redistributed, or adapted. Please see the text section below for full attribution and licensing information for each dataset.

Ferraz-Caetano:

GeckoQ:

Wang:

  • License: Creative Commons Attribution 3.0 Unported (CC BY 3.0)

  • Original data file: Supplementary data (acp-17-7529-2017-supplement.zip)

  • Source: https://acp.copernicus.org/articles/17/7529/2017/acp-17-7529-2017-supplement.zip

  • Publication: Chen Wang, Tiange Yuan, Stephen A. Wood, Kai-Uwe Goss, Jingyi Li, Qi Ying, and Frank Wania, Uncertain Henry's law constants compromise equilibrium partitioning calculations of atmospheric oxidation products, Atmos. Chem. Phys., 17, 7529–7540 (2017)
    DOI: https://doi.org/10.5194/acp-17-7529-2017

  • Modifications: Documented in data_preprocessing.ipynb (formatting and restructuring of files)

Li:

  • License: Creative Commons Attribution 4.0 International (CC BY 4.0)

  • Original data file: Li et al. OA viscosity_Table S2.xls

  • Source: https://doi.org/10.5194/acp-20-8103-2020-supplement

  • Publication: Ying Li, Douglas A. Day, Harald Stark, Jose L. Jimenez, and Manabu Shiraiwa, Predictions of the glass transition temperature and viscosity of organic aerosols from volatility distributions, Atmos. Chem. Phys., 20, 8103–8122 (2020)
    DOI: https://doi.org/10.5194/acp-20-8103-2020

  • Modifications: Documented in CAS_to_smiles.ipynb (formatting and restructuring of files)

Kruger-confined and Kruger-broad:

  • License: Creative Commons Attribution 4.0 International (CC BY 4.0)

  • Original data files;
    new_confined_data_training_info_noH.pickle,
    new_broad_data_training_info_noH.pickle

  • Source: https://doi.org/10.17617/3.GIKHJL

  • Publication: Matteo Krüger, Tommaso Galeazzo, Ivan Eremets, Bertil Schmidt, Ulrich Pöschl, Manabu Shiraiwa, and Thomas Berkemeier, Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC2NN), Geosci. Model Dev., 18, 7357–7371 (2025)
    DOI: https://doi.org/10.5194/gmd-18-7357-2025

Files

ATMOMACCS.zip

Files (141.8 MB)

Name Size Download all
md5:da954268e37f0d5b7d7831791a65e837
1.6 MB Preview Download
md5:e8533469a9a0fdad3b8e55501b8d6c8b
140.2 MB Preview Download

Additional details

Related works

Is variant form of
Poster: 10.5194/egusphere-egu25-5719 (DOI)

Funding

Research Council of Finland
Virtual laboratory for molecular level atmospheric transformations / Consortium: VILMA 346376

Dates

Available
2025-10-21
Preprint

Software

Programming language
Python

References

  • Ferraz-Caetano, J., Teixeira, F., Cordeiro, M.N.D.S., Data-driven, explainable machine learning model for predicting volatile organic compounds' standard vaporization enthalpy, Chemosphere, 359, 142257 (2024), https://doi.org/10.1016/j.chemosphere.2024.142257
  • Besel, V., Todorović, M., Kurtén, T., Rinke, P., Vehkamäki, H., Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules, Sci Data 10, 450 (2023), https://doi.org/10.1038/s41597-023-02366-x
  • Wang, C., Yuan, T., Wood, S.A., Goss, K.-U., Li, J., Ying, Q., Wania, F., Uncertain Henry's law constants compromise equilibrium partitioning calculations of atmospheric oxidation products, Atmos. Chem. Phys., 17, 7529–7540 (2017), https://doi.org/10.5194/acp-17-7529-2017
  • Li, Y., Day, D.A., Stark, H., Jimenez, J.L., Shiraiwa, M., Predictions of the glass transition temperature and viscosity of organic aerosols from volatility distributions, Atmos. Chem. Phys., 20, 8103–8122 (2020), https://doi.org/10.5194/acp-20-8103-2020
  • Krüger, M., Galeazzo, T., Eremets, I., Schmidt, B., Pöschl, U., Shiraiwa, M., and Berkemeier, T., Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC2NN), Geosci. Model Dev., 18, 7357–7371, (2025), https://doi.org/10.5194/gmd-18-7357-2025
  • RDKit: Open-source cheminformatics. https://www.rdkit.org