Published July 19, 2021 | Version v1
Dataset Open

Multi-layout Invoice Document Dataset (MIDD)

  • 1. Symbiosis Institute of Technology, Symbiosis International (Deemed University)
  • 2. Symbiosis Institute of Technology

Description

Research Purpose/Goal of Multi-Layout Invoice Document Dataset (MIDD)

· To provide the annotated and varied invoice layout documents in IOB format to identify and extract named entities (named entity recognition) from the invoice documents to the researchers working in this domain. Obtaining a high-quality and sufficient annotated corpus for automated information extraction from unstructured documents is the biggest challenge researchers face.

· To overcome the limitations of rule-based and template-based named entity extraction from unstructured documents traditionally used so far in information extraction approaches. Template-free processing is the only key to processing, and managing a huge pile of unstructured documents in the recent digitized era.

· To provide varied invoice layouts so that researchers can develop a generalized AI-based model that will train on various unstructured invoice layouts. Obtained structured output can later be utilized for integrating into information management application of the organization and used for the decision-making process.

Notes

Pune

Files

Files (1.1 MB)

Name Size Download all
md5:e10bbe1e50a10af5b465d3accf80bce0
1.1 MB Download