Published April 29, 2024 | Version v1
Dataset Open

Datasets for "Reading Order Independent Metrics for Information Extraction in Handwritten Documents"

  • 1. TEKLIA

Description

This repository includes the five datasets used for our paper entitled Reading Order Independent Metrics for Information Extraction in Handwritten Documents, in which we compare various metrics to evaluate end-to-end information extraction from scanned documents.

Datasets

Five datasets are released following the BIO format:

  • IAM
  • Simara
  • POPP
  • Esposalles
  • French Military Records

For each dataset, we provide the following data (on test sets):

  • Ground truth annotations (gt/)
  • Automatic predictions (dan/)
  • Automatic predictions with entities appearing in random order (dan_shuffled/)

The data is organized as follows:

├── Dataset name/
│   ├── gt/
│   ├── dan/
│   └── dan_shuffled/

Metrics

To install the ie-eval package, run pip install ie-eval.

To compute all metrics on a specific dataset, run:

ie-eval all --label-dir IAM_paragraph/gt/ --prediction-dir IAM_paragraph/dan/

To learn more about the various options, use the --help argument or read the documentation.

 

Files

datasets.zip

Files (2.9 MB)

Name Size Download all
md5:cae039515544710f6b239763c56172a3
2.9 MB Preview Download
md5:a71ce0bf6fd95faae5354c31909a2c5e
1.3 kB Preview Download