Images dataset for Chemical Images Classifier model
Description
Original paper
The manually curated images dataset is a part of the Supplementary Materials of the paper: A. Krasnov, S. Barnabas, T. Böhme, S. Boyer, L. Weber, Comparing software tools for optical chemical structure recognition, Digital Discovery (2024). https://doi.org/10.1039/D3DD00228D
Images dataset description
The dataset was used to generate the image classifier model. The dataset consists of 16,000 images that were collected from different sources:
1) Chemical data images extracted from EP, US, and WO patents by OntoChem GmbH.
2) Images from the MolScribe datasets https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480
3) DECIMER–hand-drawn molecule images dataset H.O. Brinkhaus, A. Zielesny, C. Steinbeck, K. Rajan, “DECIMER - hand-drawn molecule images dataset”, 2022, Journal of Cheminformatics, 14, 36. https://doi.org/10.1186/s13321-022-00620-9
4) Images from the Rxnscribe training set Y. Qian, J. Guo, Z. Tu, C.W. Coley, R. Barzilay, “RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing”, 2023, arXiv:2305.11845v1, https://doi.org/10.48550/arXiv.2305.11845
5) Formulas images from the im2latex-100k dataset A prebuilt dataset for OpenAI's task for image-2-latex system, https://zenodo.org/record/56198#.YJjuCGZKgox (accessed 16 Januar 2024)
Structure of dataset
The dataset consists of two directories:
The "classified" directory contains manually labeled images. These images are divided into four distinct categories, with each category including 4000 images:
● one_molecule
● several_molecules
● reactions
● other
In the “for_model” folder, we have split the images for training, validation, and testing in order to create a Chemical Image Classifier model:
● training: 12,804 images
● test: 1,604 images
● validation: 1,604 images.
Files
dataset_for_image_classifier.zip
Files
(1.1 GB)
Name | Size | Download all |
---|---|---|
md5:e7e47f2aa4af4b79874a19382cec06da
|
1.1 GB | Preview Download |
Additional details
Related works
- Is supplement to
- Publication: 10.1039/D3DD00228D (DOI)
- Preprint: 10.26434/chemrxiv-2023-d6kmg-v2 (DOI)
Software
- Repository URL
- https://github.com/ontochem/ChemIC
- Programming language
- Python
- Development Status
- Active