Published December 13, 2023 | Version v2
Dataset Open

FATURA Dataset

  • 1. Digital Research Center Of Sfax

Description

The dataset consists of 10000 jpg images with white backgrounds, 10000 jpg images with colored backgrounds (the same colors used in the paper) as well as 3x10000 json annotation files. The images are generated from 50 different templates. For each template, 200 images were generated. We provide annotations in three formats: our own original format, the COCO format and a format compatible with HuggingFace Transformers. Background color varies across templates but not across instances from the same template.

In terms of objects, the dataset contains 24 different classes. The classes vary considerably in their numbers of occurrences and thus, the dataset is somewhat imbalanced.

The annotations contain bounding box coordinates, bounding box text and object classes.

We propose two methods for training and evaluating models. The models were trained until convergence ie until the model reaches optimal performance on the validation split and started overfitting. The model version used for evaluation is the one with the best validation performance.

First Evaluation strategy:
For each template, the generated images are randomly split into 3 subsets: training, validation and testing.
In this scenario, the model trains on all templates and is thus tested on new images rather than new layouts.

Second Evaluation strategy:
The real templates are randomly split into a training set, and a common set of templates for validation and testing. All the variants created from the training templates are used as training dataset. The same is done to form the validation and testing datasets. The validation and testing sets are made up of the same templates but of different images.
This approach tests the models' performance on different unseen templates/layouts, rather than the same templates with different content.

We provide the data splits we used for every evaluation scenario. We also provide the background colors we used as augmentation for each template.

Notes

This dataset was developed in the Digital Research Center of Sfax.

Files

FATURA2.zip

Files (690.7 MB)

Name Size Download all
md5:4c9404462f22c5241eb1a290a02eb2a2
690.7 MB Preview Download

Additional details

References

  • @misc{limam2023fatura, title={FATURA: A Multi-Layout Invoice Image Dataset for Document Analysis and Understanding}, author={Mahmoud Limam and Marwa Dhiaf and Yousri Kessentini}, year={2023}, eprint={2311.11856}, archivePrefix={arXiv}, primaryClass={cs.CV} }