Published July 27, 2018 | Version 1.0
Dataset Open

Oficio de Hipotecas de Girona. A dataset of Spanish notarial deeds (18th Century) for Handwritten Text Recognition and Layout Analysis of historical documents.

Description

This dataset is a subset of 596 documents from the Registre d'Hipoteques de Girona of 1769 collection, guarded by the Arxiu Històric de Girona. This collection, is composed by hundreds of thousands of notarial deeds from the XVIII-XIX century (1768-1862). Sales, redemption of censuses, inheritance and matrimonial chapters are among the most common documentary typologies in the collection.

This dataset is composed of more than 23700 text lines written by a single hand, covering more that 50 different topics (documentary typologies) and a vocabulary of more than 2400 different words. The documents are transcribed using the so-called diplomatic criteria. Additionally, transcripts were tagged with 
extra enriching/complementary information (e.g. expansion of the abbreviations, hyphen marks, etc.). Along with the transcripts  the layout of the document is detected and recorded. Pages have been labeled using six different layout regions.

The images along with their respective ground-truth was compiled in PAGE compliant XML format
by the Centre de Recerca d'Història Rural and the HTR group of the Pattern Recognition and Human Language Technologies Research Center.

Notes

This work is partially funded by READ project (Ref. 674943), Spanish Ministry of Science and Innovation project HAR2014-54891-P/HIST, ICREA Acadèmia 2013 and Fundación BBVA project EXPLORHIST.

Files

Files (16.3 GB)

Name Size Download all
md5:d736e4f6271424d7e40c192c19c1c947
16.3 GB Download

Additional details

Funding

READ – Recognition and Enrichment of Archival Documents 674943
European Commission