Published July 5, 2021
| Version 0.1
Dataset
Open
Supplementary data for the manuscript: Image2SMILES: Transformer-based Molecular Optical Recognition Engine
- 1. Syntelly LLC
- 2. Skolkovo Institute of Science and Technology
Description
This is the supplementary data for the manuscript: Image2SMILES: Transformer-based Molecular Optical Recognition Engine
It contains pairs of image-string, generated from 1M SMILES strings. These strings were randomly chosen from PubChem database.
It was prepared using the code, published at https://github.com/syntelly/img2smiles_generator/
To unpack do:
tar xvf subset_1M.tar.xz && tar xvf subset_1M_dump.tar.gz && rm subset_1M_dump.tar.gz
You'll get the following data:
- subset_1M.smi - list of 1M source SMILES
- subset_1M_dump - directory with images
- subset_1M_result.csv - list of pairs FGSMILES - pathcode, first 3 chars of pathcode are corresponding subdirs in subset_1M_dump
- subset_1M_fails.csv - list of failed molecules from subset_1M.smi
- subset_1M_grpcounter.lst - list of counted groups, used in this generation
You can generate your own data using https://github.com/syntelly/img2smiles_generator/
Files
Files
(5.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:6c3801f6346f0ee7f3e7dfb266c6231a
|
5.0 GB | Download |