DECIMER Image classifier dataset
Description
Images dataset divided into train (10905114 images), validation (2115528 images) and test (544946 images) folders containing a balanced number of images for two classes (chemical structures and non-chemical structures).
The chemical structures were generated using RanDepict to random picked compounds from the ChEMBL30 database and the COCONUT database.
The non-chemical structures were generated using Python or they were retrieved from several public datasets:
COCO dataset, MIT Places-205 dataset, Visual Genome dataset, Google Open labeled Images, MMU-OCR-21 (kaggle), HandWritten_Character (kaggle), CoronaHack -Chest X-Ray-dataset (kaggle), PANDAS Augmented Images (kaggle), Bacterial_Colony (kaggle), Ceylon Epigraphy Periods (kaggle), Chinese Calligraphy Styles by Calligraphers (kaggle), Graphs Dataset (kaggle), Function_Graphs Polynomial (kaggle), sketches (kaggle), Person Face Sketches (kaggle), Art Pictograms (kaggle), Russian handwritten letters (kaggle), Handwritten Russian Letters (kaggle), Covid-19 Misinformation Tweets Labeled Dataset (kaggle) and grapheme-imgs-224x224 (kaggle).
This data was used to build a CNN classification model using as a base model EfficienNetB0 and fine tuning it. The model is available on Github.
Files
Files
(126.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:9e847917a39c65e40d66a504adcdb169
|
126.6 GB | Download |
Additional details
References
- Brinkhaus, H.O., Rajan, K., Zielesny, A. et al. RanDepict: Random chemical structure depiction generator. J Cheminform 14, 31 (2022). https://doi.org/10.1186/s13321-022-00609-4
- Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017) 'The ChEMBL database in 2017.' Nucleic Acids Res., 45(D1) D945-D954
- Lin, Tsung-Yi et al. (2014). Microsoft COCO: Common Objects in Context. https://arxiv.org/abs/1405.0312
- B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for Scene Recognition using Places Database. Advances in Neural Information Processing Systems 27 (NIPS), 2014.
- Krishna, Ranhay et al. Visual Genome. Connecting Language and Vision Using Crowdsourced Dense Image Annotations. http://visualgenome.org/static/paper/Visual_Genome.pdf
- https://storage.googleapis.com/openimages/web/index.html
- T. Nasir, M. K. Malik and K. Shahzad, "MMU-OCR-21: Towards End-to-End Urdu Text Recognition Using Deep Learning," in IEEE Access, doi: 10.1109/ACCESS.2021.3110787
- https://www.kaggle.com/datasets/vaibhao/handwritten-characters
- https://www.kaggle.com/datasets/praveengovi/coronahack-chest-xraydataset
- https://www.kaggle.com/datasets/amyjang/pandatilesagg?select=all_images
- https://www.kaggle.com/datasets/nilay1987/bacterial-colony
- https://www.kaggle.com/datasets/pabasar/ceylon-epigraphy-periods
- https://www.kaggle.com/datasets/yuanhaowang486/chinese-calligraphy-styles-by-calligraphers
- https://www.kaggle.com/datasets/sunedition/graphs-dataset
- https://www.kaggle.com/datasets/kopfgeldjaeger/function-graphs-polynomial
- https://www.kaggle.com/datasets/vishnunkumar/sketches
- https://www.kaggle.com/datasets/almightyj/person-face-sketches
- https://www.kaggle.com/datasets/olgabelitskaya/art-pictogram
- https://www.kaggle.com/datasets/tatianasnwrt/russian-handwritten-letters
- https://www.kaggle.com/datasets/olgabelitskaya/handwritten-russian-letters
- https://www.kaggle.com/datasets/arashnic/misinfo-graph
- https://www.kaggle.com/datasets/roycezjq/graphemeimgs224x224