Merging data, the essence of creation of multi-layer corpora

doi:10.5281/zenodo.15640

Published March 6, 2014 | Version v1

Poster Open

Merging data, the essence of creation of multi-layer corpora

1. Humboldt-Universität zu Berlin
2. Universität Potsdam

The last couple of years have shown an increasing number of multi layer corpora. Such
corpora allow the analysis of phenomena spreading through multiple annotation layers, for
example corpora like TueBaDZ (see: http://www.sfs.uni-tuebingen.de/ascl/ressourcen/corpora/tueba-dz.html), PCC (see: http://www.ling.uni-potsdam.de/acl-lab/Forsch/pcc/pcc.html), FALKO (see: https://u.hu-berlin.de/falko) and many other corpora contain annotations on syntactical, rhetorical, information structural and other layers. Often, annotations were created manually, or semi-automatically with different tools like EXMARaLDA (see: http://www.exmaralda.org/), TreeTagger (see: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) and MMAX2 (see: http://mmax2.sourceforge.net/).
These tools are powerful and usable but unfortunately only provide a minimum of
interoperability, which impedes the creation of multi layer corpora. Thus, multiple layers of
such corpora often had to be merged by hand or by very proprietary scripts implemented for
just one use case and therefore could not be reused for other corpora easily.
With this poster we present a tool which merges several layers of annotations into a single
multi layer corpus. When creating a multi layer corpus, several analyses base on the same
primary data and often also on the same tokenization.
We started merging the data on the tokenization level and traversed bottom up, to merge even
higher levels of annotation. This concept is implemented in a module for the converter
framework Pepper (see: https://u.hu-berlin.de/saltnpepper) with use of the common meta-
model Salt. By using Pepper, the merging module is able to handle all formats which can be
imported by a Pepper module. Multi layered corpora then can be mapped into a multilayer
formats like PAULA (Chiarcos et al. 2008), GrAF (Ide & Suderman 2007) or can be imported
into ANNIS.

Files

DGfS2014_poster_zipser_et_al.pdf

Files (561.0 kB)

Name	Size	Download all
DGfS2014_poster_zipser_et_al.pdf md5:718e40c8487fbdd390cf3400e5cbd8ad	561.0 kB	Preview Download

Additional details

Dipper S. (2005). XML-based Stand-off Representation and Exploitation of Multi-Level
Linguistic Annotation. In: Eckstein R., Tolksdorf R. (eds.) Berliner XML Tage.
Ide N.& Suderman K.(2007). GrAF: A Graph-based Format for Linguistic Annotations. In:
Proceedings of the Linguistic Annotation Workshop, Prague, Czech Republic.
Schmid, H. (1995). Improvements in Part-of-Speech Tagging with an Application to German.
Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland.
Stede M. (2004). The Potsdam commentary corpus. In Proceedings of the 2004 ACL
Workshop on Discourse Annotation (DiscAnnotation '04), Bonnie Webber and Donna Byron
(Eds.). Association for Computational Linguistics, Stroudsburg, PA, USA, 96-102.
Telljohann, H./Hinrichs, E. W./Kübler, S./Zinsmeister, H./Beck, K. (2009). Stylebook for the
Tübingen Treebank of Written German (TüBa-D/Z). Universität Tübingen Seminar für
Sprachwissenschaft.
Reznicek, M.; Lüdeling, A.; Krummes, C.; Schwantuschke, F.; Walter, M.; Schmidt, K.;
Hirschmann, H.; Andreas, T. (2012). Das Falko-Handbuch. Korpusaufbau und Annotationen
Version 2.01
Zeldes, Amir, Ritz, Julia, Lüdeling, Anke & Chiarcos, Christian (2009). "ANNIS: A Search
Tool for Multi-Layer Annotated Corpora". In: Proceedings of Corpus Linguistics 2009, July
20-23, Liverpool, UK.
Zipser F., Romary L. (2010). A model oriented approach to the mapping of annotation formats
using standards In: Proceedings of the Workshop on Language Resource and Language
Technology Standards, LREC 2010. Malta. URL:
http://hal.archives-ouvertes.fr/inria-00527799/en/
Zipser F., Zeldes A., Ritz J., Romary L. & Leser U. (2011). Pepper: Handling a multiverse of
formats 33. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft. Göttingen,
23.- 25. Februar 2011

	All versions	This version
Views	224	224
Downloads	160	160
Data volume	93.1 MB	93.1 MB

Merging data, the essence of creation of multi-layer corpora

Creators

Description

Files

DGfS2014_poster_zipser_et_al.pdf

Files (561.0 kB)

Additional details

References