Consolidating the heterogeneous landscape of literary corpora
Authors/Creators
Description
The Computational Literary Studies Infrastructure (CLS Infra) project aims to consolidate the heterogeneous landscape of data and tools in the CLS domain, with a specific focus on literary corpora. This contribution describes a data integration task against the background of developing a bespoke data model.
As part of the project (D6.1, Ďurčo et al., 2022), a metadata inventory of literary corpora was compiled to provide an overview of existing datasets, including their formats and mode of access. Based on the information derived from this initial collection of metadata and the Metamodel for Corpus Metadata (MKM; Odebrecht, 2018) as a conceptual starting point, the CLSCor ontology has been developed within the project as a unifying conceptual model to describe various aspects of literary texts. It introduces the basic entities: Corpus, Corpus Document and Feature, Feature being a generic mechanism to capture any specific characteristics of manifestations of literary works or collections thereof. These can be structural or semantic phenomena, like paragraphs, distinct speakers in a drama or verses in a poem. To guarantee its interoperability, the CLSCor ontology is based on CIDOC CRM and its extensions CRMdig and LRMoo. The CLSCor ontology will be accompanied by a set of controlled vocabularies (modeled in SKOS) that are currently in development.
In order to validate the applicability of the proposed ontology, data from three distinct datasets covering main literary genres were processed: novels (ELTeC), drama (DraCor), and poetry (POSTDATA). Not only do these sample datasets represent the different genres, but also three different underlying data structures and solutions for providing the data, making them ideal as proof of concept. The general workflow comprised mapping the respective source models to the CLSCor data model, transforming the data into RDF using custom scripts and merging the converted datasets into one consolidated knowledge graph. Given the distinct ways these three datasets were offered, a custom solution was implemented for each:
ELTeC data, exposed as TEI files in a github repository, was obtained by accessing the Github API.
Relevant information was extracted using XPath expressions and transformed into CLSCor-compliant RDF via a Python script.
Data from DraCor – TEI encoded corpora of plays – was integrated by transforming the JSON data returned by the DraCor API to RDF with a custom Python script.
In the case POSTDATA, the legacy tool "Horace", originally used in the project's data generation workflow, has been adapted to generate RDF data conforming to the CLSCor ontology.
After the milestone of merging three distinct sources into one graph, the graph is being evaluated for coherence, consistency and usability for exploration via a discovery application. The outcomes of this evaluation will inform further iterations and the integration of further sources.
Next to a rich catalogue of literary corpora, another envisioned outcome is robust conversion workflows for a variety of data sources and formats, accompanied by tutorials and promoted in training events to foster their reuse.
Files
Consolidating the heterogeneous landscape of literary corpora.pdf
Files
(1.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b0e31adecca4bef2e90aaa2597f5e207
|
66.0 kB | Preview Download |
|
md5:2938796fa139c96fd940539a1b21f2e0
|
1.4 MB | Preview Download |