Poster Open Access
Druskat, Stephan; Krause, Thomas; Odebrecht, Carolin; Zipser, Florian
The creation and analysis of corpus linguistic resources can be a costly and error-prone process. Apart from the complexity of the annotation process itself, there are larger technical obstacles to be overcome. Single tools have to be combined in a common workflow, and different formats taken into account. This poster presents a family of well-aligned open source tools which support the conversion, annotation, and analysis of linguistic corpora, as well as securing their long-term accessibility, in a complete workflow.
The interoperability of these tools is guaranteed by the use of a common data model – Salt (Zipser & Romary, 2010) – which, among other things, is used as an intermediate model for the conversion framework Pepper (Zipser et al., 2011). With Pepper, many linguistic formats can be converted into each other, thereby allowing existing data to be included in the workflow. The support for a multitude of linguistic formats allows for the replacement of single components as well as the integration of further tools into the workflow presented here.
The annotation of corpora is carried out in Atomic (Druskat et al., 2014), an extensible annotation platform. Atomic also utilizes Salt – in this case as its concrete data model – and thus allows for theory-neutral annotation which is independent of tagsets and annotation types. By embedding Pepper, it supports a wide variety of source formats for further annotation, as well as target formats for export. Additionally, its plugin-based architecture makes it possible to easily extend the software, e.g., with additional editors, data views, or processing components. For a new annotation type, for instance, a dedicated editor can thus be created and integrated.
At any point in the annotation process, the annotated data can be transferred to the search and visualization tool ANNIS (Krause & Zeldes, 2014) for visualisation and analysis. Conclusions from the analysis can then, for example, also flow back into the annotation process. When a corpus is ready for publication, it can be released in different formats to a public repository – in the case of historical text corpora, for example, the LAUDATIO-Repository (Odebrecht et al., 2015). Third parties can then download, reference and re-use the data.