Published January 18, 2025 | Version v.1
Computational notebook Open

Boosting Predictability: Towards Rapid Estimation of Organic Molecule Solubility

  • 1. ROR icon Aalto University

Contributors

Contact person:

Work package leader:

  • 1. ROR icon Aalto University
  • 2. Aalto Univerisy
  • 3. ROR icon University of Turku
  • 4. ROR icon Loughborough University

Description

Machine Learning for Predicting Solubility

The water solubility of organic molecules is critical for optimizing the performance and stability of aqueous flow batteries, as well as for various other applications. While solubility measurements are relatively straightforward in some cases, theoretical predictions remain a significant challenge. Machine learning algorithms have become invaluable tools over the past decade to address this. High-quality data and effective descriptors are essential for building reliable, data-driven estimation models. This repository systematically investigates the effectiveness of enhanced structure-based descriptors and an outlier detection procedure to improve aqueous solubility predictability.

 

Installation

  1. Clone the repository:
    git clone git@github.com:sahashemip/ML4OrganicMoleculeSolubility.git
     
  2. Navigate to the project directory:
    cd ML4OrganicMoleculeSolubility
     
  3. Install the required dependencies:
    pip install -r requirements.txt

    How to Use

    1. Navigate to the notebooks directory:

      • Open and run the Jupyter notebooks sequentially based on the numbering:
        1. analysis
        2. descriptors
        3. ml-models
        4. outlier-detection
    2. Outlier Detection:

      • To perform outlier detection, modify the parameters in the outlier_detector.py script. Refer to the data in TABLE I of the associated manuscript for parameter details.

    Project Structure

    • notebooks/: Contains step-by-step Jupyter notebooks for analysis, feature engineering, and model development.
    • scripts/: Includes Python scripts for outlier detection and custom preprocessing utilities.
    • datasets/: Holdes all different datasets generated by distinct descriptors.
    • outliers/: Stores outputs related to the detected outliers.

Files

ML4OrganicMoleculeSolubility-main.zip

Files (11.8 MB)

Name Size Download all
md5:c31df67ca61fefc38c620b04c93d5011
11.8 MB Preview Download

Additional details

Related works

Is supplement to
Publication: 10.26434/chemrxiv-2025-4111w (DOI)

Dates

Submitted
2025-01-18

Software

Repository URL
https://github.com/sahashemip/ML4OrganicMoleculeSolubility
Programming language
Python