Published July 5, 2021 | Version 0.1
Dataset Open

Supplementary data for the manuscript: Image2SMILES: Transformer-based Molecular Optical Recognition Engine

  • 1. Syntelly LLC
  • 2. Skolkovo Institute of Science and Technology

Description

This is the supplementary data for the manuscript: Image2SMILES: Transformer-based Molecular Optical Recognition Engine

It contains pairs of image-string, generated from 1M SMILES strings. These strings were randomly chosen from PubChem database.
It was prepared using the code, published at https://github.com/syntelly/img2smiles_generator/

To unpack do:
tar xvf subset_1M.tar.xz && tar xvf subset_1M_dump.tar.gz && rm subset_1M_dump.tar.gz

You'll get the following data:

  • subset_1M.smi - list of 1M source SMILES
  • subset_1M_dump - directory with images          
  • subset_1M_result.csv - list of pairs FGSMILES - pathcode, first 3 chars of pathcode are corresponding subdirs in subset_1M_dump
  • subset_1M_fails.csv - list of failed molecules from subset_1M.smi
  • subset_1M_grpcounter.lst - list of counted groups, used in this generation

You can generate your own data using https://github.com/syntelly/img2smiles_generator/ 

Files

Files (5.0 GB)

Name Size Download all
md5:6c3801f6346f0ee7f3e7dfb266c6231a
5.0 GB Download