Language Resources for Historical Newspapers: the Impresso Collection

Maud Ehrmann; Matteo Romanello; Simon Clematide; Philipp Ströbel; Raphaël Barman

doi:10.5281/zenodo.4641902

Published May 25, 2020 | Version v1

Conference paper Open

Language Resources for Historical Newspapers: the Impresso Collection

1. EPFL
2. UZH

Following decades of massive digitization, an  unprecedented amount of historical document facsimiles can  now be retrieved and accessed via cultural heritage online  portals. If this represents a huge step forward in terms of  preservation and accessibility, the next fundamental  challenge-- and real promise of digitization-- is to  exploit the contents of these digital assets, and therefore  to adapt and develop appropriate language technologies to  search and retrieve information from this `Big Data of the  Past'. Yet, the application of text processing tools on  historical documents in general, and historical newspapers  in particular, poses new challenges, and crucially requires  appropriate language resources. In this context, this paper  presents a collection of historical newspaper data sets  composed of text and image resources, curated and published  within the context of the `impresso - Media Monitoring of  the Past' project. With corpora, benchmarks, semantic  annotations and language models in French, German and  Luxembourgish covering ca. 200 years, the objective of the  impresso resource collection is to contribute to historical  language resources, and thereby strengthen the robustness  of approaches to non-standard inputs and foster efficient  processing of historical documents.

Files

2020-LREC-HistoricalNewspaper-ImpressoProject.pdf

Files (813.4 kB)

Name	Size	Download all
2020-LREC-HistoricalNewspaper-ImpressoProject.pdf md5:38b89f72507cd62d02c71b3dacca0132	813.4 kB	Preview Download

Additional details

Swiss National Science Foundation
Media Monitoring of the Past CRSII5_173719

	All versions	This version
Views	218	218
Downloads	117	117
Data volume	103.3 MB	103.3 MB

Language Resources for Historical Newspapers: the Impresso Collection

Authors/Creators

Description

Files

2020-LREC-HistoricalNewspaper-ImpressoProject.pdf

Files (813.4 kB)

Additional details

Funding