Published November 21, 2023 | Version v1
Dataset Open

Images dataset for Chemical Images Classifier model

  • 1. OntoChem GmbH

Description

Original paper

The manually curated images dataset is a part of the Supplementary Materials of the paper: A. Krasnov, S. Barnabas, T. Böhme, S. Boyer, L. Weber, Comparing software tools for optical chemical structure recognition, Digital Discovery (2024). https://doi.org/10.1039/D3DD00228D

Images dataset description

The dataset was used to generate the image classifier model. The dataset consists of 16,000 images that were collected from different sources:

1)    Chemical data images extracted from EP, US, and WO patents by OntoChem GmbH.

2)    Images from the MolScribe datasets https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480

3)    DECIMER–hand-drawn molecule images dataset H.O. Brinkhaus, A. Zielesny, C. Steinbeck, K. Rajan, “DECIMER - hand-drawn molecule images dataset”, 2022, Journal of Cheminformatics, 14, 36. https://doi.org/10.1186/s13321-022-00620-9

4)    Images from the Rxnscribe training set Y. Qian, J. Guo, Z. Tu, C.W. Coley, R. Barzilay, “RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing”, 2023,  arXiv:2305.11845v1, https://doi.org/10.48550/arXiv.2305.11845 

5)    Formulas images from the im2latex-100k dataset A prebuilt dataset for OpenAI's task for image-2-latex system, https://zenodo.org/record/56198#.YJjuCGZKgox (accessed 16 Januar 2024)

Structure of dataset

The dataset consists of two directories:

The "classified" directory contains manually labeled images. These images are divided into four distinct categories, with each category including 4000 images:

●      one_molecule

●      several_molecules

●      reactions

●      other

In the “for_model” folder, we have split the images for training, validation, and testing in order to create a Chemical Image Classifier model:

●      training: 12,804 images

●      test: 1,604 images

●      validation: 1,604 images.

 

Files

dataset_for_image_classifier.zip

Files (1.1 GB)

Name Size Download all
md5:e7e47f2aa4af4b79874a19382cec06da
1.1 GB Preview Download

Additional details

Related works

Is supplement to
Publication: 10.1039/D3DD00228D (DOI)
Preprint: 10.26434/chemrxiv-2023-d6kmg-v2 (DOI)

Software

Repository URL
https://github.com/ontochem/ChemIC
Programming language
Python
Development Status
Active