Poster Open Access

Merging data, the essence of creation of multi-layer corpora

Zipser, Florian; Frank, Mario; Schmolling, Jakob

The last couple of years have shown an increasing number of multi layer corpora. Such
corpora allow the analysis of phenomena spreading through multiple annotation layers, for
example corpora like TueBaDZ (see:, PCC (see:, FALKO (see: and many other corpora contain annotations on syntactical, rhetorical, information structural and other layers. Often, annotations were created manually, or semi-automatically with different tools like EXMARaLDA (see:, TreeTagger (see: and MMAX2 (see:
These tools are powerful and usable but unfortunately only provide a minimum of
interoperability, which impedes the creation of multi layer corpora. Thus, multiple layers of
such corpora often had to be merged by hand or by very proprietary scripts implemented for
just one use case and therefore could not be reused for other corpora easily.
With this poster we present a tool which merges several layers of annotations into a single
multi layer corpus. When creating a multi layer corpus, several analyses base on the same
primary data and often also on the same tokenization.
We started merging the data on the tokenization level and traversed bottom up, to merge even
higher levels of annotation. This concept is implemented in a module for the converter
framework Pepper (see: with use of the common meta-
model Salt. By using Pepper, the merging module is able to handle all formats which can be
imported by a Pepper module. Multi layered corpora then can be mapped into a multilayer
formats like PAULA (Chiarcos et al. 2008), GrAF (Ide & Suderman 2007) or can be imported
into ANNIS.

