Unlocking the Corpus: Enriching Metadata with State-of-the-Art NLP Methodology and Linked Data

Ecker, Jennifer; Fischer, Stefan; Schwarz, Pia; Trippel, Thorsten; Werthmann, Antonina; Wilm, Rebecca

doi:10.5281/zenodo.13985646

Published October 15, 2024 | Version v1

Poster Open

Unlocking the Corpus: Enriching Metadata with State-of-the-Art NLP Methodology and Linked Data

1. Leibniz Institute for the German Language
2. Saarland University

In research data management, descriptive metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles (Wilkinson et al., 2016). Extracting semantic metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. Our approach is to add semantic metadata at the text level to facilitate the search over data. We show how to enrich metadata with three NLP methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search for texts that are about certain topics or described by certain keywords, or to identify people, places, and organisations mentioned in texts without actually having to read them and at the same time facilitate the creation of task-tailored subcorpora. To enhance this usability of the data we explore options based on the German Reference Corpus DeReKo, the largest linguistically motivated collection of German language material (Kupietz & Keibel, 2009; Kupietz et al., 2010, 2018), which contains multiple newspapers, books, transcriptions, etc., and enrich its metadata on the level of subportions, i.e. newspaper articles.

Files

CLARIN2024_Poster_UnlockingCorpus_Textplus.pdf

Files (450.7 kB)

Name	Size	Download all
CLARIN2024_Poster_UnlockingCorpus_Textplus.pdf md5:737270af7499fdfc9633f49f607c6f0a	450.7 kB	Preview Download

	All versions	This version
Views	113	113
Downloads	134	134
Data volume	70.3 MB	70.3 MB

Unlocking the Corpus: Enriching Metadata with State-of-the-Art NLP Methodology and Linked Data

Creators

Description

Files

CLARIN2024_Poster_UnlockingCorpus_Textplus.pdf

Files (450.7 kB)