Published September 2, 2021 | Version v1
Conference paper Open

Information Extraction from Invoices

  • 1. Université de La Rochelle
  • 2. Yooz

Description

The present paper is focused on information extraction from key fields of invoices using two different methods based on sequence labeling. Invoices are semi-structured documents in which data can be located based on the context. Common information extraction systems are model-driven, using heuristics and lists of trigger words curated by domain experts. Their performances are generally high on documents they have been trained for but processing new templates often requires new manual annotations, which is tedious and time-consuming to produce. Recent works on deep learning applied to business documents claimed a gain in terms of time and performance. While these systems do not need manual curation, they nevertheless require a large amount of data to achieve good results. In this paper, we present a series of experiments using neural networks approaches to study the trade-off between data requirements and performance in the extraction of information from key fields of invoices (such as dates, document numbers, types, amounts...). The main contribution of this paper is a system that achieves competitive results using a small amount of data compared to the state-of-the-art systems that need to be trained on large datasets, that are costly and impractical to produce in real-world applications.

Files

ICDAR_2021_Data_Extraction_from_Invoices.pdf

Files (464.5 kB)

Name Size Download all
md5:0d3e6aecc289779f273e7a6a9f85ab63
464.5 kB Preview Download

Additional details

Funding

European Commission
NewsEye - NewsEye: A Digital Investigator for Historical Newspapers 770299
Agence Nationale de la Recherche
IDEAS - International Document Engineering, Analysis and Security lab ANR-18-LCV3-0008