Published September 10, 2019 | Version v1
Presentation Open

Curation Technologies for a Cultural Heritage Archive: "Project Tongilbu"

  • 1. DFKI GmbH

Description

We are developing a platform for generic curation technologies, using various NLP procedures, that is specifically targeted at, but not limited to, document collections that are too large for humans to (manually) read and go through. The aim then is to provide prototypical NLP tools like NER, Entity Linking, clustering and summarization in order to support rapid exploration of a data set.

In this particular submission, the data set in question is the result of "Project Tongilbu”, a report funded by the Korean Ministry of Re-unification, on the unification of East- and West-Germany in the 1990’s. The majority of the content in this data set is in German, with small parts in Korean. With the collection being a set of PDF files, we first apply OCR to extract machine-readable text.
Focusing on German, we then apply an NER model trained on Wikipedia data, retrieve URIs of recognized entities in the GND (Gemeinsame Normdatei, a German database of entities with additional information), perform temporal analysis and cluster documents according to the retrieved entities they contain. This is then visualized in a curation dashboard.

Since support (in terms of tooling, but also training data) for Korean is limited, for the Korean texts we experiment with Machine Translation on the texts extracted from the PDFs, to then apply the German pipeline and project annotations back onto the original Korean text.

Files

utrecht_DHW_2019.pdf

Files (1.6 MB)

Name Size Download all
md5:784a78672ff6ea92d27f01a17480afba
1.6 MB Preview Download