CLS INFRA D6.3 Standards beyond TEI / Extended Transformation Matrix / Alternative Formats
- 1. Austrian Academy of Sciences, Austrian Centre for Digital Humanities and Cultural Heritage
- 2. University of Potsdam
Description
This deliverable builds on and further extends the findings of D6.1 "Inventory of existing data sources and formats" surveying the landscape of literary corpora, as well as D8.1 "Tools for NLP" cataloguing the set of tools in the context of CLS. Focusing on the wealth of formats used when encoding and processing text, it offers a comprehensive overview of common formats for encoding textual data, beyond the "lingua franca", TEI, both in the domain of computational literary studies and computational linguistics, highlighting potential discrepancies in the approach between these two areas of research. The overview reveals a very heterogeneous landscape with a plethora of formats, devised for differing tasks, from philological encoding of historical text material, to computational annotation and processing of text.
Considering interoperability an indispensable key to reusability, the deliverable explores the challenges and approaches converting between formats.
This information compilation is considered input for further developing the Transformation Matrix, introduced in D6.1, which shall serve as a conceptual framework to consolidate existing solutions for format conversion in the Transformation Toolbox to be delivered by the end of the project (D6.2). The Transformation Matrix shall allow to capture information about specific data structures (features) present in datasets as well as data structures required or produced by tools. This requires a sufficiently expressive formalised description, which is proposed in the CLSCor data model.
Files
D6.3_Standards_beyond_TEI.pdf
Files
(897.2 kB)
Name | Size | Download all |
---|---|---|
md5:5cdd5f629f138d819e6f7aba06b54c89
|
897.2 kB | Preview Download |