Data processing of ILS data to facilitate a new discovery layer for the German Literature Archive (DLA)
Kallías, the OPAC of the German Literature Archive in Marbach, is used by scholars worldwide as an information system and for access to literary sources. It provides five entry points to the collections: Manuscripts, library objects, images and objects, holdings and names; thus representing the high-quality cataloging in different divisions of the institution. Since 2017 a new discovery layer has been developed to integrate all sources into a cross-media, tailor-made online catalog. Although using a classic Solr based (non linked data) approach the new catalog makes productive use of authority data and relationships between works and special collections.
The new catalog is still in closed beta and is going to be released at the end of 2019. The presentation will focus on the custom data processing pipeline which is based on the Open Source tools Pandas (a Python library) and OpenRefine. 4 Million records are extracted from the local ILS, transformed into a tabular format, manipulated with custom rulesets, enriched with external data sources and loaded into a Solr index every day. The pipeline is orchestrated with simple Bash shell scripts that makes it easy to extend the workflow with other command line tools. By making legacy ILS data available in OpenRefine, library staff is enabled to use their data in other contexts (e.g. for digitization projects) and to publish their data in different formats (e.g. EAD-XML for the Kalliope union catalog).