Published October 7, 2025 | Version v2
Dataset Open

Datasets for MulitModalSpectralTransformer project

  • 1. ROR icon AstraZeneca (Sweden)

Description

Data Repository for "Advancing Structure Elucidation with a Flexible Multi-Spectral AI Model"

IMPORTANT NOTE: This is part of a multi-part data repository. Due to the large size of the files, the complete dataset, models, and experimental results are split across three separate Zenodo uploads. To fully reproduce the findings of our publication, you must download all files from all three of the following links:

This repository contains the complete dataset, pre-trained models, and experimental results required to reproduce the findings presented in our publication, "Advancing Structure Elucidation with a Flexible Multi-Spectral AI Model." The source code is available separately on GitHub.

Contents

The downloaded files are split archives that, once combined and extracted, will create the following folders:

  • models/: Pre-trained model weights for MultiModalSpectralTransformer (MMST), SGNN, Mol2Mol, and ChemProp-IR networks.

  • data/: Complete training and validation datasets, including:

    • ZINC dataset (4M molecules with simulated spectra)

    • PubChem dataset (1.5M molecules with simulated spectra)

    • IBM Alberts dataset (650k molecules with simulated spectra)

  • past_experiments/: Complete experimental validation results, benchmarks, and reproducibility data.


Note: To reproduce the reviewer-requested global impact analysis, you must also download the Reviewer_Experiment_Global_Impact.zip file separately and extract its contents into the past_experiments/ChemXriv/ directory after completing the main setup.

Setup Instructions

  1. Clone the GitHub repository:
    git clone https://github.com/mpriessner/MultiModalSpectralTransformer.git

  2. Download all compressed files: Make sure to download every .partXX file from all three Zenodo links provided above. Place all of them together in the same directory.

  3. Combine and Extract: Once all parts are downloaded, use a file archiver that supports split .tar.xz archives (like tar on Linux/macOS or 7-Zip on Windows) to extract the contents. You only need to run the extraction command on the first part of each archive (e.g., data.tar.xz.partaa); the tool will automatically find and combine the other parts.

  4. Organize Folders: Move the extracted folders (models, data, and past_experiments) directly into the main repository directory you cloned in step 1.

After extraction, your repository structure should contain the necessary models/, data/, and past_experiments/ folders with all the files required for running the notebooks and reproducing the results.

For implementation details, usage instructions, and the complete source code, please refer to our Github Reporitory: https://github.com/mpriessner/MultiModalSpectralTransformer


Files

Reviewer_Experiment_Global_Impact.zip

Files (16.3 GB)

Name Size Download all
md5:daf2620b2cbbc09660e9475adddd5c3a
1.0 GB Download
md5:6e15844777237e9e50dec7d5f7364a15
1.0 GB Download
md5:0324a44a07bd18042e86fdd6487aa96a
1.0 GB Download
md5:221a096ed5ea0f04a08bb0f5d4a0b237
1.0 GB Download
md5:cc01a9b3cb16807ae5c13b4352b2f5dc
1.0 GB Download
md5:30bd08655011e7b3e7524bdb6eb4fbf6
1.0 GB Download
md5:e20b62bd2b071dbe726ea365b2228b27
1.0 GB Download
md5:8462bdedbef509665704763271b338c3
1.0 GB Download
md5:5eb512b83d6950fc1f97f9039f5d6947
1.0 GB Download
md5:48ba98a007c92807db0a0ad1d3fc5888
1.0 GB Download
md5:73a8b72dad5fec5bfde99039c9ae8f7c
1.0 GB Download
md5:c028e824bc6f0ee93963ed6cf414175d
1.0 GB Download
md5:cbf805bcb6a3e965231605712e8e463c
1.0 GB Download
md5:267696118651cc3953c161618fbc65a2
883.5 MB Download
md5:db7c5af08a0076df53c6541859d77ee7
1.8 GB Preview Download

Additional details

Dates

Available
2025-07-21