Poster Open Access

Merging data, the essence of creation of multi-layer corpora

Zipser, Florian; Frank, Mario; Schmolling, Jakob

The last couple of years have shown an increasing number of multi layer corpora. Such
corpora allow the analysis of phenomena spreading through multiple annotation layers, for
example corpora like TueBaDZ (see:, PCC (see:, FALKO (see: and many other corpora contain annotations on syntactical, rhetorical, information structural and other layers. Often, annotations were created manually, or semi-automatically with different tools like EXMARaLDA (see:, TreeTagger (see: and MMAX2 (see:
These tools are powerful and usable but unfortunately only provide a minimum of
interoperability, which impedes the creation of multi layer corpora. Thus, multiple layers of
such corpora often had to be merged by hand or by very proprietary scripts implemented for
just one use case and therefore could not be reused for other corpora easily.
With this poster we present a tool which merges several layers of annotations into a single
multi layer corpus. When creating a multi layer corpus, several analyses base on the same
primary data and often also on the same tokenization.
We started merging the data on the tokenization level and traversed bottom up, to merge even
higher levels of annotation. This concept is implemented in a module for the converter
framework Pepper (see: with use of the common meta-
model Salt. By using Pepper, the merging module is able to handle all formats which can be
imported by a Pepper module. Multi layered corpora then can be mapped into a multilayer
formats like PAULA (Chiarcos et al. 2008), GrAF (Ide & Suderman 2007) or can be imported
into ANNIS.

Files (561.0 kB)
Name Size
561.0 kB Download
  • (Eds.). Association for Computational Linguistics, Stroudsburg, PA, USA, 96-102.
  • 20-23, Liverpool, UK.
  • 23.- 25. Februar 2011
  • Dipper S. (2005). XML-based Stand-off Representation and Exploitation of Multi-Level
  • formats 33. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft. Göttingen,
  • Hirschmann, H.; Andreas, T. (2012). Das Falko-Handbuch. Korpusaufbau und Annotationen
  • Ide N.& Suderman K.(2007). GrAF: A Graph-based Format for Linguistic Annotations. In:
  • Linguistic Annotation. In: Eckstein R., Tolksdorf R. (eds.) Berliner XML Tage.
  • Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland.
  • Proceedings of the Linguistic Annotation Workshop, Prague, Czech Republic.
  • Reznicek, M.; Lüdeling, A.; Krummes, C.; Schwantuschke, F.; Walter, M.; Schmidt, K.;
  • Schmid, H. (1995). Improvements in Part-of-Speech Tagging with an Application to German.
  • Sprachwissenschaft.
  • Stede M. (2004). The Potsdam commentary corpus. In Proceedings of the 2004 ACL
  • Technology Standards, LREC 2010. Malta. URL:
  • Telljohann, H./Hinrichs, E. W./Kübler, S./Zinsmeister, H./Beck, K. (2009). Stylebook for the
  • Tool for Multi-Layer Annotated Corpora". In: Proceedings of Corpus Linguistics 2009, July
  • Tübingen Treebank of Written German (TüBa-D/Z). Universität Tübingen Seminar für
  • using standards In: Proceedings of the Workshop on Language Resource and Language
  • Version 2.01
  • Workshop on Discourse Annotation (DiscAnnotation '04), Bonnie Webber and Donna Byron
  • Zeldes, Amir, Ritz, Julia, Lüdeling, Anke & Chiarcos, Christian (2009). "ANNIS: A Search
  • Zipser F., Romary L. (2010). A model oriented approach to the mapping of annotation formats
  • Zipser F., Zeldes A., Ritz J., Romary L. & Leser U. (2011). Pepper: Handling a multiverse of
All versions This version
Views 3737
Downloads 44
Data volume 2.2 MB2.2 MB
Unique views 3636
Unique downloads 44


Cite as