IR-NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

Zipoli, Federico; Alberts, Marvin; Laino, Teodoro

doi:10.5281/zenodo.16417648

Published July 23, 2025 | Version v3

Dataset Open

IR-NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

1. IBM Research Europe - Zurich

Contributors

Data collectors:

# NMR and IR Spectra Dataset for Organic Molecules

This dataset contains computed NMR and IR spectroscopic data for a wide range of organic molecules. It is intended to support research in computational chemistry, spectroscopy, and machine learning.

## Contents

- `NMR_data.parquet`: Nuclear Magnetic Resonance (NMR) data for 1,255 unique organic molecules.
- `IR_data_chunkXXX_of_009.parquet`: Infrared (IR) spectra for 177,461 unique molecules, split into 9 files for easier handling.

---

## NMR Dataset

**File**: `NMR_data.parquet`

This file contains dynamic NMR properties computed from molecular dynamics simulations. For each of the 1,255 molecules, NMR shifts were calculated over several MD frames to capture thermal effects and conformational flexibility.

Each record includes:

- `id`: Unique molecule identifier
- `smiles`: SMILES string representing the molecule
- `atoms`: List of atomic symbols
- `xyz`: Cartesian coordinates of the atoms
- `frames`: Dictionary of frame-specific data including:
- `c_nmr_peaks_ave`, `h_nmr_peaks_ave`: Averaged chemical shifts
- `c_nmr_peaks_unsorted`, `h_nmr_peaks_unsorted`: Raw per-atom shifts
- `nmr_cpmd_text`: CPMD-calculated shielding data with TMS referencing
- Frame-specific atomic positions
- `averaged_frames`: Averaged shifts over all frames
- `h_nmr_std`, `c_nmr_std`: Standard deviations across frames
- Additional metadata like group indices for chemically equivalent atoms

This dataset enables training and validation of machine learning models that aim to predict NMR shifts based on 3D structure and thermal motion.

---

## IR Spectra Dataset

**Files**:
`IR_data_chunk001_of_009.parquet` to `IR_data_chunk009_of_009.parquet`

Due to the large size of the dataset, IR spectra are provided in 9 separate Parquet files. The first 8 files contain 20,000 molecules each; the final chunk contains the remainder.

Each molecule includes:

- `id`: Unique molecule identifier
- `smiles`: SMILES string representation
- `type`: Atom types (as integer codes: B=0, Br=1, C=2, Cl=3, F=4, H=5, I=6, N=7, O=8, P=9, S=10, Si=11)
- `Frequency(cm^-1)`: IR frequency axis (up to 4003.17 cm⁻¹, 0.3336 cm⁻¹ resolution)
- `ir_spectra`: Quantum-corrected IR intensity values

### IR Calculation Details

- Dipole moment trajectories were sampled every 2.5 fs over 100 ps (40,001 time steps).
- The IR spectra were computed via Fourier transform of the autocorrelation of the dipole signal, using a Blackman window to reduce spectral leakage.
- Quantum correction factors were applied to approximate spectra at 300 K, following Gaigeot and Sprik.

These spectra are suitable for spectral analysis, machine learning applications, and comparison to experimental data.

---

## Format & Usage

All files are provided in Apache Parquet format for efficient access with Python (`pandas`, `pyarrow`) or other data processing tools.

Example loading in Python:

import pandas as pd
df_nmr = pd.read_parquet("NMR_data.parquet")
df_ir = pd.read_parquet("IR_data_chunk001_of_009.parquet")

To load all IR data chunk files into a single dataframe:

# List of files to read
files = [
"./IR_data_chunk001_of_009.parquet",
"./IR_data_chunk002_of_009.parquet",
"./IR_data_chunk003_of_009.parquet",
"./IR_data_chunk004_of_009.parquet",
"./IR_data_chunk005_of_009.parquet",
"./IR_data_chunk006_of_009.parquet",
"./IR_data_chunk007_of_009.parquet",
"./IR_data_chunk008_of_009.parquet",
"./IR_data_chunk009_of_009.parquet",
]

df_ir = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)

## Citation and Further Information

For a detailed description of the dataset, including computational methods, dataset generation protocols, and benchmark analyses, please refer to our accompanying preprint:

Title: IR–NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

Preprint: ChemRxiv Link: https://chemrxiv.org/engage/chemrxiv/article-details/684f1f86c1cb1ecda0230ceb

ChemRxiv DOI: 10.26434/chemrxiv-2025-l0zqz

The publication includes complete details about the contents of both the NMR and IR datasets, their intended use cases, and guidance for applying them in computational spectroscopy and machine learning research.

Please cite this preprint if you use the dataset in your work.

Files

ERROR_PRED_TRUE_DIPOLE.zip

Files (8.1 GB)

Name	Size	Download all
ERROR_PRED_TRUE_DIPOLE.zip md5:5b228a7268ce05e876e38e7e15e1fc02	59.4 MB	Preview Download
IR_data_chunk001_of_009.parquet md5:1b9982e8901f37b4fd1b6b6b4ffba0b9	901.0 MB	Download
IR_data_chunk002_of_009.parquet md5:e484e367f087edf5251fe6afa9f9187f	899.5 MB	Download
IR_data_chunk003_of_009.parquet md5:ba7aad7aa33c79e629bae90822a1b78f	899.5 MB	Download
IR_data_chunk004_of_009.parquet md5:962961978c9c41bb4395c85081e8f308	899.5 MB	Download
IR_data_chunk005_of_009.parquet md5:7b72d5de9ab6ba75266fb75f10b9f5f5	899.5 MB	Download
IR_data_chunk006_of_009.parquet md5:10e69c8b0986cd2c95d01b137045654d	899.5 MB	Download
IR_data_chunk007_of_009.parquet md5:a71d9deb634f2249df4611f1c67d7ecf	899.5 MB	Download
IR_data_chunk008_of_009.parquet md5:748785e049db8a03d4b4df5f260cc1c6	899.6 MB	Download
IR_data_chunk009_of_009.parquet md5:193faad55b945885b0e7740a9b2ab257	785.5 MB	Download
NMR_data.parquet md5:5ba7fa1cfa1ea03a8c46dab8ed3c3501	23.4 MB	Download
README.md md5:a8c3d090bdecfb18c0b28782b179cb30	3.1 kB	Preview Download
scripts_ir_nmr_multimodal_comp_spectra_dataset.tar.gz md5:0e7eef2c9f439f5afee8704c00df7ebd	8.6 MB	Download

Additional details

Other: https://chemrxiv.org/engage/chemrxiv/article-details/684f1f86c1cb1ecda0230ceb

Is described by: Publication: 10.26434/chemrxiv-2025-l0zqz (DOI)

Swiss National Science Foundation
NCCR Catalysis (phase I) 180544
Swiss National Science Foundation
NCCR Catalysis (phase II) 225147

Updated: 2025-07-01

Adding ERROR_PRED_TRUE_DIPOLE.zip to compare predicted dipoles

401

Views

676

Downloads

Show more details

	All versions	This version
Views	401	215
Downloads	676	464
Data volume	784.5 GB	615.2 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

License: Community Data License Agreement Permissive 2.0

Community Data License Agreement – Permissive – Version 2.0 This is the Community Data License Agreement – Permissive, Version 2.0 (the “agreement”). Data Provider(s) and Data Recipient(s) agree as follows: 1. Provision of the Data 1.1. A Data Recipient may use, modify, and share the Data made available by Data Provider(s) under this agreement if that Data Recipient follows the terms of this agreement. 1.2. This agreement does not impose any restriction on a Data Recipient’s use, modification, or sharing of any portions of the Data that are in the public domain or that may be used, modified, or shared under any other legal exception or limitation. 2. Conditions for Sharing Data 2.1. A Data Recipient may share Data, with or without modifications, so long as the Data Recipient makes available the text of this agreement with the shared Data. 3. No Restrictions on Results 3.1. This agreement does not impose any restriction or obligations with respect to the use, modification, or sharing of Results. 4. No Warranty; Limitation of Liability 4.1. All Data Recipients receive the Data subject to the following terms: THE DATA IS PROVIDED ON AN “AS IS” BASIS, WITHOUT REPRESENTATIONS, WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. NO DATA PROVIDER SHALL HAVE ANY LIABILITY FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING WITHOUT LIMITATION LOST PROFITS), HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE DATA OR RESULTS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 5. Definitions 5.1. “Data” means the material received by a Data Recipient under this agreement. 5.2. “Data Provider” means any person who is the source of Data provided under this agreement and in reliance on a Data Recipient’s agreement to its terms. 5.3. “Data Recipient” means any person who receives Data directly or indirectly from a Data Provider and agrees to the terms of this agreement. 5.4. “Results” means any outcome obtained by computational analysis of Data, including for example machine learning models and models’ insights. Link https://cdla.dev/permissive-2-0/ Read more

Technical metadata

Created: July 25, 2025
Modified: July 25, 2025

IR-NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

Contributors

Data collectors:

Files

ERROR_PRED_TRUE_DIPOLE.zip

Files (8.1 GB)

Additional details

Identifiers

Related works

Funding

Dates

IR-NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

Creators

Contributors

Data collectors:

Description

Files

ERROR_PRED_TRUE_DIPOLE.zip

Files (8.1 GB)

Additional details

Identifiers

Related works

Funding

Dates