Published May 25, 2020 | Version v1
Conference paper Open

Language Resources for Historical Newspapers: the Impresso Collection

Description

Following decades of massive digitization, an  unprecedented amount of historical document facsimiles can  now be retrieved and accessed via cultural heritage online  portals. If this represents a huge step forward in terms of  preservation and accessibility, the next fundamental  challenge-- and real promise of digitization-- is to  exploit the contents of these digital assets, and therefore  to adapt and develop appropriate language technologies to  search and retrieve information from this `Big Data of the  Past'. Yet, the application of text processing tools on  historical documents in general, and historical newspapers  in particular, poses new challenges, and crucially requires  appropriate language resources. In this context, this paper  presents a collection of historical newspaper data sets  composed of text and image resources, curated and published  within the context of the `impresso - Media Monitoring of  the Past' project. With corpora, benchmarks, semantic  annotations and language models in French, German and  Luxembourgish covering ca. 200 years, the objective of the  impresso resource collection is to contribute to historical  language resources, and thereby strengthen the robustness  of approaches to non-standard inputs and foster efficient  processing of historical documents.

Files

2020-LREC-HistoricalNewspaper-ImpressoProject.pdf

Files (813.4 kB)

Name Size Download all
md5:38b89f72507cd62d02c71b3dacca0132
813.4 kB Preview Download

Additional details

Funding

Swiss National Science Foundation
Media Monitoring of the Past CRSII5_173719