im2latex 230k
Authors/Creators
Description
The dataset comprises of over 230,000 LaTeX math formulas and their corresponding .png images. The images vary in size and have a resolution of 72dpi. These formulas were extracted from LaTeX sources, originally from arXiv, and were parsed to create the dataset. The dataset size has been increased from 180,000 to 230,000 in version 3. The dataset was generated using a tool built with JavaScript and Python, which is available on GitHub. For further details, please refer to the following link: https://github.com/gmarus777/Printed-Latex-Data-Generation
Formulas were parsed from LaTeX sources provided here: http://www.cs.cornell.edu/projects/kddcup/datasets.html(originally from arXiv).
Contents:
- folder `generated_png_images` contains PNG images
- `corresponding_png_images.txt` each new line contains png images filename for the folder `generated_png_images`
- `final_png_formulas.txt` each new line contains a corresponing LaTex formula
- `230k.json` contains a vocabulary consisting of 579 tokens.
Version 3 updates:
-- Dataset size increase to 230k (from 180k)
Files
PRINTED_TEX_230k.zip
Files
(1.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:aad8439670252c4f5d7773f44f86cd01
|
1.0 GB | Preview Download |