IR-NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules
Contributors
Data collectors:
Description
# NMR and IR Spectra Dataset for Organic Molecules
This dataset contains computed NMR and IR spectroscopic data for a wide range of organic molecules. It is intended to support research in computational chemistry, spectroscopy, and machine learning.
## Contents
- `NMR_data.parquet`: Nuclear Magnetic Resonance (NMR) data for 1,255 unique organic molecules.
- `IR_data_chunkXXX_of_009.parquet`: Infrared (IR) spectra for 177,461 unique molecules, split into 9 files for easier handling.
---
## NMR Dataset
**File**: `NMR_data.parquet`
This file contains dynamic NMR properties computed from molecular dynamics simulations. For each of the 1,255 molecules, NMR shifts were calculated over several MD frames to capture thermal effects and conformational flexibility.
Each record includes:
- `id`: Unique molecule identifier
- `smiles`: SMILES string representing the molecule
- `atoms`: List of atomic symbols
- `xyz`: Cartesian coordinates of the atoms
- `frames`: Dictionary of frame-specific data including:
- `c_nmr_peaks_ave`, `h_nmr_peaks_ave`: Averaged chemical shifts
- `c_nmr_peaks_unsorted`, `h_nmr_peaks_unsorted`: Raw per-atom shifts
- `nmr_cpmd_text`: CPMD-calculated shielding data with TMS referencing
- Frame-specific atomic positions
- `averaged_frames`: Averaged shifts over all frames
- `h_nmr_std`, `c_nmr_std`: Standard deviations across frames
- Additional metadata like group indices for chemically equivalent atoms
This dataset enables training and validation of machine learning models that aim to predict NMR shifts based on 3D structure and thermal motion.
---
## IR Spectra Dataset
**Files**:
`IR_data_chunk001_of_009.parquet` to `IR_data_chunk009_of_009.parquet`
Due to the large size of the dataset, IR spectra are provided in 9 separate Parquet files. The first 8 files contain 20,000 molecules each; the final chunk contains the remainder.
Each molecule includes:
- `id`: Unique molecule identifier
- `smiles`: SMILES string representation
- `type`: Atom types (as integer codes: B=0, Br=1, C=2, Cl=3, F=4, H=5, I=6, N=7, O=8, P=9, S=10, Si=11)
- `Frequency(cm^-1)`: IR frequency axis (up to 4003.17 cm⁻¹, 0.3336 cm⁻¹ resolution)
- `ir_spectra`: Quantum-corrected IR intensity values
### IR Calculation Details
- Dipole moment trajectories were sampled every 2.5 fs over 100 ps (40,001 time steps).
- The IR spectra were computed via Fourier transform of the autocorrelation of the dipole signal, using a Blackman window to reduce spectral leakage.
- Quantum correction factors were applied to approximate spectra at 300 K, following Gaigeot and Sprik.
These spectra are suitable for spectral analysis, machine learning applications, and comparison to experimental data.
---
## Format & Usage
All files are provided in Apache Parquet format for efficient access with Python (`pandas`, `pyarrow`) or other data processing tools.
Example loading in Python:
import pandas as pd
df_nmr = pd.read_parquet("NMR_data.parquet")
df_ir = pd.read_parquet("IR_data_chunk001_of_009.parquet")
To load all IR data chunk files into a single dataframe:
# List of files to read
files = [
"./IR_data_chunk001_of_009.parquet",
"./IR_data_chunk002_of_009.parquet",
"./IR_data_chunk003_of_009.parquet",
"./IR_data_chunk004_of_009.parquet",
"./IR_data_chunk005_of_009.parquet",
"./IR_data_chunk006_of_009.parquet",
"./IR_data_chunk007_of_009.parquet",
"./IR_data_chunk008_of_009.parquet",
"./IR_data_chunk009_of_009.parquet",
]
df_ir = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)
## Citation and Further Information
For a detailed description of the dataset, including computational methods, dataset generation protocols, and benchmark analyses, please refer to our accompanying preprint:
Title: IR–NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules
Preprint: ChemRxiv Link: https://chemrxiv.org/engage/chemrxiv/article-details/684f1f86c1cb1ecda0230ceb
ChemRxiv DOI: 10.26434/chemrxiv-2025-l0zqz
The publication includes complete details about the contents of both the NMR and IR datasets, their intended use cases, and guidance for applying them in computational spectroscopy and machine learning research.
Please cite this preprint if you use the dataset in your work.
Files
ERROR_PRED_TRUE_DIPOLE.zip
Files
(8.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:5b228a7268ce05e876e38e7e15e1fc02
|
59.4 MB | Preview Download |
|
md5:1b9982e8901f37b4fd1b6b6b4ffba0b9
|
901.0 MB | Download |
|
md5:e484e367f087edf5251fe6afa9f9187f
|
899.5 MB | Download |
|
md5:ba7aad7aa33c79e629bae90822a1b78f
|
899.5 MB | Download |
|
md5:962961978c9c41bb4395c85081e8f308
|
899.5 MB | Download |
|
md5:7b72d5de9ab6ba75266fb75f10b9f5f5
|
899.5 MB | Download |
|
md5:10e69c8b0986cd2c95d01b137045654d
|
899.5 MB | Download |
|
md5:a71d9deb634f2249df4611f1c67d7ecf
|
899.5 MB | Download |
|
md5:748785e049db8a03d4b4df5f260cc1c6
|
899.6 MB | Download |
|
md5:193faad55b945885b0e7740a9b2ab257
|
785.5 MB | Download |
|
md5:5ba7fa1cfa1ea03a8c46dab8ed3c3501
|
23.4 MB | Download |
|
md5:a8c3d090bdecfb18c0b28782b179cb30
|
3.1 kB | Preview Download |
|
md5:0e7eef2c9f439f5afee8704c00df7ebd
|
8.6 MB | Download |
Additional details
Identifiers
- Other
- https://chemrxiv.org/engage/chemrxiv/article-details/684f1f86c1cb1ecda0230ceb
Related works
- Is described by
- Publication: 10.26434/chemrxiv-2025-l0zqz (DOI)
Funding
- Swiss National Science Foundation
- NCCR Catalysis (phase I) 180544
- Swiss National Science Foundation
- NCCR Catalysis (phase II) 225147
Dates
- Updated
-
2025-07-01Adding ERROR_PRED_TRUE_DIPOLE.zip to compare predicted dipoles