Published November 23, 2023 | Version v1
Dataset Open

Data of Paper "Turning a Multilingual Historical Archive into an Information System through Post-OCR Correction and Content-Based Indexation"

Description

We evaluated our approach on a collection of 946 historical documents belonging to the Biblioteca Nacional de Catalunya (BNC), spanning from 1914 to 1951. Each document is the issue of a magazine, comprising different articles by different authors. This implies that, despite the thematic nature of magazines and specific issues, there is a certain degree of heterogeneity in each document. Magazines were selected based on their relevance w.r.t. art in general and, more specifically, early 20th century avant-garde movements (e.g., Dadaism, Cubism, etc.). For each document, we have the scanning of the original artifact and the plain raw text extracted through ABBYY FineReader OCR tool. To the best of our knowledge, this is the first Catalan-dominated OCR corpus ever released. 

Files

dataset.zip

Files (7.6 GB)

Name Size Download all
md5:b8447a1b740ba1b60172e4b3252014ec
7.6 GB Preview Download