Published March 15, 2023 | Version 3
Dataset Open

im2latex 230k

  • 1. ROR icon University of California, Berkeley
  • 2. Penn State

Description

The dataset comprises of over 230,000 LaTeX math formulas and their corresponding .png images. The images vary in size and have a resolution of 72dpi. These formulas were extracted from LaTeX sources, originally from arXiv, and were parsed to create the dataset. The dataset size has been increased from 180,000 to 230,000 in version 3. The dataset was generated using a tool built with JavaScript and Python, which is available on GitHub. For further details, please refer to the following link: https://github.com/gmarus777/Printed-Latex-Data-Generation

Formulas were parsed from LaTeX sources provided here: http://www.cs.cornell.edu/projects/kddcup/datasets.html(originally from  arXiv). 

Contents:
- folder `generated_png_images` contains PNG images
- `corresponding_png_images.txt` each new line contains png images filename for the folder `generated_png_images`
- `final_png_formulas.txt` each new line contains a corresponing LaTex formula
- `230k.json` contains a vocabulary consisting of 579 tokens.

 

Version 3 updates:

-- Dataset size increase to 230k (from 180k)

Files

PRINTED_TEX_230k.zip

Files (1.0 GB)

Name Size Download all
md5:aad8439670252c4f5d7773f44f86cd01
1.0 GB Preview Download