The IBEM Dataset: a large printed scientific image dataset for indexing and searching mathematical expressions
Description
The IBEM dataset consists of 600 documents with a total number of 8272 pages, containing 29603 isolated and 137089 embedded Mathematical Expressions (MEs). The objective of the IBEM dataset is to facilitate the indexing and searching of MEs in massive collections of STEM documents. The dataset was built by parsing the LaTeX source files of documents from the KDD Cup Collection. Several experiments can be carried out with the IBEM dataset ground-truth (GT): ME detection and extraction, ME recognition, etc.
The dataset consists of the following files:
- “IBEM.json”: file containing the IBEM GT information. The data is firstly organized by pages, then by the type of expression (“embedded” or “displayed”), and lastly by the GT of each individual ME. For each ME we provide:
- xy page-level coordinates, reported as relative (%) to the width/height of the page image.
- “split” attribute indicating the number of fragments in which the ME has been split. MEs can be split over various lines, columns or pages. The LaTeX transcript of split MEs have been exactly replicated (entire LaTeX definition) for each fragment.
- “latex” original transcript as extracted from the LaTeX source files of the documents. This definition can contain user-defined macros. In order to be able to compile these expressions, each page includes the preamble of the source files containing the defined macros and the packages used by the authors of the documents.
- “latex_expand” transcript reconstructed from the output stream of the LuaLaTeX engine in which user-defined macros have been expanded. The transcript has the same visual representation as the original transcript, with the addition that the LaTeX definitions are tokenized, the order of sub/super script elements have been fixed, and matrices have been transformed to arrays.
- “latex_norm” transcript resulting from applying an extra normalization process to the “latex_expand” expression. This normalization process includes removing font information such as slant, style, and weight.
- “partitions/*.lst”: files containing list of pages forming the partition sets.
- “pages/*.jpg”: individual pages extracted from the documents.
The dataset is partitioned into various sets as provided for the ICDAR 2021 Competition on Mathematical Formula Detection. The ground-truth related to this competition, which is included in this dataset version, can also be found here. More information about the competition can be found in the following paper:
D. Anitei, J.A. Sánchez, J.M. Fuentes, R. Paredes, and J.M. Benedí. ICDAR 2021 Competition on Mathematical Formula Detection. In ICDAR, pages 783–795, 2021.
For ME recognition tasks, we recommend rendering the “latex_expand” version of the formulae in order to create standalone expressions that have the same visual representation as MEs found in the original documents (see attached python script “extract_GT.py”). Extracting MEs from the documents based on coordinates is more complex, as special care is needed to concatenate the fragments of split expressions. Baseline results for ME recognition tasks will soon be made available.
Notes
Files
IBEM.zip
Files
(2.8 GB)
Name | Size | Download all |
---|---|---|
md5:711aa3b6ff70b8682df5ca7e29e35f85
|
2.8 GB | Preview Download |
Additional details
Related works
- Cites
- 10.1007/978-3-030-86337-1_52 (DOI)