Dataset Open Access

InftyMCCDB-2 dataset

Mahshad Mahdavi

InftyMCCDB-2 dataset is a modified version of InftyCDB-2 which contains mathematical expressions from scanned article pages.

The original dataset has 21,056 math expressions. We remove formulas with matrices and grids, leaving 19,381 formulas. The dataset includes 213 symbol classes, and is split into two sets: training (12551 images), and testing (6830 images) with approximately the same distribution of symbol classes and relation classes. The expressions range in size from a single symbol to more than 75 symbols, with an average of 7.33 symbols per expression. 

The original InftyCDB-2 provides ground truth at the symbol level. We extracted connected component bounding boxes, and generated new ground truth for each image using a labeled adjacency matrix (`label graph') representation.

The set of .lg (label graph) ground truth files are provided, along with a .png image for each expression.

Files (71.2 MB)
Name Size
IMG.zip
md5:0ed9bcd2759e391202d64bd56edfc955
27.2 MB Download
LG.zip
md5:6d82935ac1f8d2b511c08aaf75593d25
35.5 MB Download
LG_test.zip
md5:429d8a488ace9b4203b8c131fed5ffd3
8.5 MB Download
  • @inproceedings{Mahdavi2019LPGAL, title={LPGA : Line-Of-Sight Parsing with Graph-based Attention for Math Formula Recognition}, author={Mahshad Mahdavi and Michael R. Condon and Kenny Davila}, year={2019} }

390
245
views
downloads
All versions This version
Views 390390
Downloads 245245
Data volume 6.1 GB6.1 GB
Unique views 337337
Unique downloads 9797

Share

Cite as