Data of Paper "Turning a Multilingual Historical Archive into an Information System through Post-OCR Correction and Content-Based Indexation"

Biblioteca Nacional de Catalunya; Eurecat - Technology Centre of Catalonia

doi:10.5281/zenodo.10201752

Published November 23, 2023 | Version v1

Dataset Open

Data of Paper "Turning a Multilingual Historical Archive into an Information System through Post-OCR Correction and Content-Based Indexation"

We evaluated our approach on a collection of 946 historical documents belonging to the Biblioteca Nacional de Catalunya (BNC), spanning from 1914 to 1951. Each document is the issue of a magazine, comprising different articles by different authors. This implies that, despite the thematic nature of magazines and specific issues, there is a certain degree of heterogeneity in each document. Magazines were selected based on their relevance w.r.t. art in general and, more specifically, early 20th century avant-garde movements (e.g., Dadaism, Cubism, etc.). For each document, we have the scanning of the original artifact and the plain raw text extracted through ABBYY FineReader OCR tool. To the best of our knowledge, this is the first Catalan-dominated OCR corpus ever released.

Files

dataset.zip

Files (7.6 GB)

Name	Size	Download all
dataset.zip md5:b8447a1b740ba1b60172e4b3252014ec	7.6 GB	Preview Download

Views

Downloads

Show more details

	All versions	This version
Views	86	86
Downloads	23	23
Data volume	174.7 GB	174.7 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: November 23, 2023
Modified: November 23, 2023

Data of Paper "Turning a Multilingual Historical Archive into an Information System through Post-OCR Correction and Content-Based Indexation"

Authors/Creators

Description

Files

dataset.zip

Files (7.6 GB)