Published March 6, 2014 | Version v1
Poster Open

Merging data, the essence of creation of multi-layer corpora

  • 1. Humboldt-Universität zu Berlin
  • 2. Universität Potsdam

Description

The last couple of years have shown an increasing number of multi layer corpora. Such
corpora allow the analysis of phenomena spreading through multiple annotation layers, for
example corpora like TueBaDZ (see: http://www.sfs.uni-tuebingen.de/ascl/ressourcen/corpora/tueba-dz.html), PCC (see: http://www.ling.uni-potsdam.de/acl-lab/Forsch/pcc/pcc.html), FALKO (see: https://u.hu-berlin.de/falko) and many other corpora contain annotations on syntactical, rhetorical, information structural and other layers. Often, annotations were created manually, or semi-automatically with different tools like EXMARaLDA (see: http://www.exmaralda.org/), TreeTagger (see: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) and MMAX2 (see: http://mmax2.sourceforge.net/).
These tools are powerful and usable but unfortunately only provide a minimum of
interoperability, which impedes the creation of multi layer corpora. Thus, multiple layers of
such corpora often had to be merged by hand or by very proprietary scripts implemented for
just one use case and therefore could not be reused for other corpora easily.
With this poster we present a tool which merges several layers of annotations into a single
multi layer corpus. When creating a multi layer corpus, several analyses base on the same
primary data and often also on the same tokenization.
We started merging the data on the tokenization level and traversed bottom up, to merge even
higher levels of annotation. This concept is implemented in a module for the converter
framework Pepper (see: https://u.hu-berlin.de/saltnpepper) with use of the common meta-
model Salt. By using Pepper, the merging module is able to handle all formats which can be
imported by a Pepper module. Multi layered corpora then can be mapped into a multilayer
formats like PAULA (Chiarcos et al. 2008), GrAF (Ide & Suderman 2007) or can be imported
into ANNIS.

Files

DGfS2014_poster_zipser_et_al.pdf

Files (561.0 kB)

Name Size Download all
md5:718e40c8487fbdd390cf3400e5cbd8ad
561.0 kB Preview Download

Additional details

References

  • Dipper S. (2005). XML-based Stand-off Representation and Exploitation of Multi-Level
  • Linguistic Annotation. In: Eckstein R., Tolksdorf R. (eds.) Berliner XML Tage.
  • Ide N.& Suderman K.(2007). GrAF: A Graph-based Format for Linguistic Annotations. In:
  • Proceedings of the Linguistic Annotation Workshop, Prague, Czech Republic.
  • Schmid, H. (1995). Improvements in Part-of-Speech Tagging with an Application to German.
  • Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland.
  • Stede M. (2004). The Potsdam commentary corpus. In Proceedings of the 2004 ACL
  • Workshop on Discourse Annotation (DiscAnnotation '04), Bonnie Webber and Donna Byron
  • (Eds.). Association for Computational Linguistics, Stroudsburg, PA, USA, 96-102.
  • Telljohann, H./Hinrichs, E. W./Kübler, S./Zinsmeister, H./Beck, K. (2009). Stylebook for the
  • Tübingen Treebank of Written German (TüBa-D/Z). Universität Tübingen Seminar für
  • Sprachwissenschaft.
  • Reznicek, M.; Lüdeling, A.; Krummes, C.; Schwantuschke, F.; Walter, M.; Schmidt, K.;
  • Hirschmann, H.; Andreas, T. (2012). Das Falko-Handbuch. Korpusaufbau und Annotationen
  • Version 2.01
  • Zeldes, Amir, Ritz, Julia, Lüdeling, Anke & Chiarcos, Christian (2009). "ANNIS: A Search
  • Tool for Multi-Layer Annotated Corpora". In: Proceedings of Corpus Linguistics 2009, July
  • 20-23, Liverpool, UK.
  • Zipser F., Romary L. (2010). A model oriented approach to the mapping of annotation formats
  • using standards In: Proceedings of the Workshop on Language Resource and Language
  • Technology Standards, LREC 2010. Malta. URL:
  • http://hal.archives-ouvertes.fr/inria-00527799/en/
  • Zipser F., Zeldes A., Ritz J., Romary L. & Leser U. (2011). Pepper: Handling a multiverse of
  • formats 33. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft. Göttingen,
  • 23.- 25. Februar 2011