Published July 23, 2025 | Version v3
Dataset Open

IR-NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

  • 1. IBM Research Europe - Zurich

Description

# NMR and IR Spectra Dataset for Organic Molecules

This dataset contains computed NMR and IR spectroscopic data for a wide range of organic molecules. It is intended to support research in computational chemistry, spectroscopy, and machine learning.

## Contents

- `NMR_data.parquet`: Nuclear Magnetic Resonance (NMR) data for 1,255 unique organic molecules.
- `IR_data_chunkXXX_of_009.parquet`: Infrared (IR) spectra for 177,461 unique molecules, split into 9 files for easier handling.

---

## NMR Dataset

**File**: `NMR_data.parquet`

This file contains dynamic NMR properties computed from molecular dynamics simulations. For each of the 1,255 molecules, NMR shifts were calculated over several MD frames to capture thermal effects and conformational flexibility.

Each record includes:

- `id`: Unique molecule identifier
- `smiles`: SMILES string representing the molecule
- `atoms`: List of atomic symbols
- `xyz`: Cartesian coordinates of the atoms
- `frames`: Dictionary of frame-specific data including:
  - `c_nmr_peaks_ave`, `h_nmr_peaks_ave`: Averaged chemical shifts
  - `c_nmr_peaks_unsorted`, `h_nmr_peaks_unsorted`: Raw per-atom shifts
  - `nmr_cpmd_text`: CPMD-calculated shielding data with TMS referencing
  - Frame-specific atomic positions
- `averaged_frames`: Averaged shifts over all frames
- `h_nmr_std`, `c_nmr_std`: Standard deviations across frames
- Additional metadata like group indices for chemically equivalent atoms

This dataset enables training and validation of machine learning models that aim to predict NMR shifts based on 3D structure and thermal motion.

---

## IR Spectra Dataset

**Files**:  
`IR_data_chunk001_of_009.parquet` to `IR_data_chunk009_of_009.parquet`

Due to the large size of the dataset, IR spectra are provided in 9 separate Parquet files. The first 8 files contain 20,000 molecules each; the final chunk contains the remainder.

Each molecule includes:

- `id`: Unique molecule identifier
- `smiles`: SMILES string representation
- `type`: Atom types (as integer codes: B=0, Br=1, C=2, Cl=3, F=4, H=5, I=6, N=7, O=8, P=9, S=10, Si=11)
- `Frequency(cm^-1)`: IR frequency axis (up to 4003.17 cm⁻¹, 0.3336 cm⁻¹ resolution)
- `ir_spectra`: Quantum-corrected IR intensity values

### IR Calculation Details

- Dipole moment trajectories were sampled every 2.5 fs over 100 ps (40,001 time steps).
- The IR spectra were computed via Fourier transform of the autocorrelation of the dipole signal, using a Blackman window to reduce spectral leakage.
- Quantum correction factors were applied to approximate spectra at 300 K, following Gaigeot and Sprik.

These spectra are suitable for spectral analysis, machine learning applications, and comparison to experimental data.

---

## Format & Usage

All files are provided in Apache Parquet format for efficient access with Python (`pandas`, `pyarrow`) or other data processing tools.

Example loading in Python:

import pandas as pd
df_nmr = pd.read_parquet("NMR_data.parquet")
df_ir = pd.read_parquet("IR_data_chunk001_of_009.parquet")

 

To load all IR data chunk files into a single dataframe:

# List of files to read
files = [
    "./IR_data_chunk001_of_009.parquet",
    "./IR_data_chunk002_of_009.parquet",
    "./IR_data_chunk003_of_009.parquet",
    "./IR_data_chunk004_of_009.parquet",
    "./IR_data_chunk005_of_009.parquet",
    "./IR_data_chunk006_of_009.parquet",
    "./IR_data_chunk007_of_009.parquet",
    "./IR_data_chunk008_of_009.parquet",
    "./IR_data_chunk009_of_009.parquet",
]

df_ir = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)

 

## Citation and Further Information

For a detailed description of the dataset, including computational methods, dataset generation protocols, and benchmark analyses, please refer to our accompanying preprint:

Title: IR–NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

Preprint: ChemRxiv Link: https://chemrxiv.org/engage/chemrxiv/article-details/684f1f86c1cb1ecda0230ceb

ChemRxiv DOI: 10.26434/chemrxiv-2025-l0zqz


The publication includes complete details about the contents of both the NMR and IR datasets, their intended use cases, and guidance for applying them in computational spectroscopy and machine learning research.

Please cite this preprint if you use the dataset in your work.

Files

ERROR_PRED_TRUE_DIPOLE.zip

Files (8.1 GB)

Name Size Download all
md5:5b228a7268ce05e876e38e7e15e1fc02
59.4 MB Preview Download
md5:1b9982e8901f37b4fd1b6b6b4ffba0b9
901.0 MB Download
md5:e484e367f087edf5251fe6afa9f9187f
899.5 MB Download
md5:ba7aad7aa33c79e629bae90822a1b78f
899.5 MB Download
md5:962961978c9c41bb4395c85081e8f308
899.5 MB Download
md5:7b72d5de9ab6ba75266fb75f10b9f5f5
899.5 MB Download
md5:10e69c8b0986cd2c95d01b137045654d
899.5 MB Download
md5:a71d9deb634f2249df4611f1c67d7ecf
899.5 MB Download
md5:748785e049db8a03d4b4df5f260cc1c6
899.6 MB Download
md5:193faad55b945885b0e7740a9b2ab257
785.5 MB Download
md5:5ba7fa1cfa1ea03a8c46dab8ed3c3501
23.4 MB Download
md5:a8c3d090bdecfb18c0b28782b179cb30
3.1 kB Preview Download
md5:0e7eef2c9f439f5afee8704c00df7ebd
8.6 MB Download

Additional details

Identifiers

Other
https://chemrxiv.org/engage/chemrxiv/article-details/684f1f86c1cb1ecda0230ceb

Related works

Is described by
Publication: 10.26434/chemrxiv-2025-l0zqz (DOI)

Funding

Swiss National Science Foundation
NCCR Catalysis (phase I) 180544
Swiss National Science Foundation
NCCR Catalysis (phase II) 225147

Dates

Updated
2025-07-01
Adding ERROR_PRED_TRUE_DIPOLE.zip to compare predicted dipoles