im2latex-100k , arXiv:1609.04938

Kanervisto, Anssi

doi:10.5281/zenodo.56198

Published June 21, 2016 | Version v1

Dataset Open

im2latex-100k , arXiv:1609.04938

Kanervisto, Anssi¹

1. University of Eastern Finland

A prebuilt dataset for OpenAI's task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation and test sets. Formulas were parsed from LaTeX sources provided here: http://www.cs.cornell.edu/projects/kddcup/datasets.html(originally from arXiv)

Each image is a PNG image of fixed size. Formula is in black and rest of the image is transparent.

For related tools (eg. tokenizer) check out this repository: https://github.com/Miffyli/im2latex-dataset
For pre-made evaluation scripts and built im2latex system check this repository: https://github.com/harvardnlp/im2markup

Newlines used in formulas_im2latex.lst are UNIX-style newlines (\n). Reading file with other type of newlines results to slightly wrong amount of lines (104563 instead of 103558), and thus breaks the structure used by this dataset. Python 3.x reads files using newlines of the running system by default, and to avoid this file must be opened with newlines="\n" (eg. open("formulas_im2latex.lst", newline="\n")).

Files

readme.txt

Files (306.8 MB)

Name	Size
formula_images.tar.gz md5:cf25f2408f1ea09bbd096890a6361533	292.2 MB	Download
im2latex_formulas.lst md5:974c0a14f0daa6d91ecd0e625f1ddf52	12.3 MB	Download
im2latex_test.lst md5:1bc17b865796dca5df15250b4da7804f	237.4 kB	Download
im2latex_train.lst md5:d5607c37aa00576098a9e4bad84a7040	1.9 MB	Download
im2latex_validate.lst md5:cf6eeee02bc443b1b9557685fbfe7ea5	213.7 kB	Download
readme.txt md5:3d4cb64d8c403148ff06370d71072cdc	924 Bytes	Preview Download

Additional details

Is part of: arXiv:1609.04938 (arXiv)

	All versions	This version
Views	35,309	35,163
Downloads	22,118	22,071
Data volume	5.0 TB	5.0 TB

im2latex-100k , arXiv:1609.04938

Authors/Creators

Description

Files

readme.txt

Files (306.8 MB)

Additional details

Related works