Preserving Humanities Research Data: Data Depositing in the TextGrid Repository aka The Fluffy Import
Creators
Description
If preservation is use, what are the implications for a humanities research data infrastructure?
Preserving research data long-term, making it accessible and reusable for the scientific community is a fundamental concern for research infrastructures. This boils down to an alignment of expectations between data depositor and data recipient. For example, the research data infrastructure has requirements relating to format, metadata, responsibilities and licenses.
When research data has been successfully deposited/published in a repository, the next pitfall looms: the often unappealing presentation or constrained findability of data. Applying the famous words of John Cotton Dana: If "Preservation is use", the research data infrastructure has to emphasize potential and re-usability of data.
The paper introduces the data depositing workflow of the TextGrid Repository (TGRep).
The TextGrid Repository
TGRep is a pioneer of the Digital Humanities in the German-speaking area. Today, TGRep is part of the Text+ portfolio, the NFDI consortium for language and text-based research data in Germany. Each Text+ data center offers a workflow for incorporating research data that complies to its scope, making it available for reuse.
ELTeC
ELTeC in TGRep, the European Literary Text Collection, serves as an example for the identification, consultation, ingest, transformation, enrichment, publication and integration in the portfolio of Text+, spelling re-usability and interoperability. ELTeC is a state-of-the-art, open access multilingual collection of corpora containing novels from several European traditions developed for several reasons, among them the development of tools and methods in Computational Literary Studies. Currently, ELTeC contains more than 2000 full-text novels in XML-TEI in 21 languages. They are distributed via multiple platforms (such as GitHub and Zenodo). 1365 full-texts in 15 languages are also published in the TGRep.
The Data Depositing Workflow in TGRep
As TGRep is of great relevance for Text+ and its community, so is the task of minimizing the effort spent when publishing data there. The solution implemented in Text+ consists of a workflow that automates the creation of the technical files required when importing into the repository, while allowing for as much manual intervention as needed.
To use the system, the user interacts with a web-based user interface running inside a Jupyter notebook. After specifying the location of the TEI files to be imported, the data is analyzed in an automated step, which finds and extracts metadata common to all files and makes this available for verification and manual improvement, if necessary. In a subsequent manual step, the user can check and edit the extracted metadata, but also change how the metadata is identified (in which case the previous step can be executed again). In the last step, the technical TGRep metadata files are generated.
The new workflow not only improves the data import process, but also serves as a blueprint for further easy-to-build applications that combine libraries and notebooks and rely on the versatile Jupyterlab environment, which can be deployed both locally and in the cloud.
Files
FluffyImportTGrep_DAE (3).pdf
Files
(947.0 kB)
Name | Size | Download all |
---|---|---|
md5:fa115e7649b04320a10766bb69196108
|
947.0 kB | Preview Download |
Additional details
Dates
- Created
-
2024-04-25