Poster Open Access

Pepper: Handling a multiverse of formats

Zipser, Florian; Zeldes, Amir; Ritz, Julia; Romary, Laurent; Leser, Ulf

With the rising importance of empirical data in many fields of linguistic research, we see an
increase not only in the amount of electronically available corpora, but also in the number of
tools used to make this data accessible, processable and searchable. Most of these tools have
been developed in the course of specific linguistic projects and therefore can only handle a
certain kinds of linguistic information, such as syntactic-structures (e.g. TIGERSearch, Lezius
2002), or dialogue-structures (e.g. EXMARaLDA, Schmidt 2004) etc. At the same time, each
tool uses its own, proprietary format for representing the text and its annotations. Such
formats are optimized for a specific kind of analysis and the performance of a specific
processing tool. Consequently they cannot easily be mapped onto each other. This impedes
those linguistic research questions which pre-suppose a global view on data, i.e., which
require the option to correlate, query and analyze several kinds of linguistic annotations at
We present Pepper, a modularized converter framework addressing the problem that a
linguistic researcher may be limited to a small set of questions due to the tool(s) he or she
uses. Pepper is based on the meta-model Salt (Zipser & Romary 2010) and offers the
possibility of converting data from n formats into m formats with a minimal number of
necessary mappings. The pluggable architecture of Pepper allows the injection of new formats
into the framework. Pepper has no restrictions on the underlying techniques used in
representing these formats (e.g. XML, tabular-formats, bracketing-formats or mixtures

Files (384.5 kB)
Name Size
Pepper_HandlingAMultiverseOfFormats_poster.pdf md5:f59810ce228150683d3700a4888f6bb0 384.5 kB Download
  • Annotation. In: Proceedings of the 22nd International Conference on Computational
  • Braun S. & Kohn K. & Mukherjee J.(eds.), Corpus Technology and Language
  • Dipper S. (2005). XML-based Stand-off Representation and Exploitation of Multi-Level
  • ELRA.
  • formats using standards. In: Proceedings of the Workshop on Language Resource and
  • Ide N.& Suderman K.(2007). GrAF: A Graph-based Format for Linguistic Annotations.
  • In: Proceedings of the Linguistic Annotation Workshop, Prague, Czech Republic.
  • Language Technology Standards, LREC 2010. Malta.
  • Lezius W. (2002) Ein Suchwerkzeug für syntaktisch annotierte Textkorpora. Ph.D. thesis,
  • Linguistic Annotation. In: Eckstein R., Tolksdorf R. (eds.) Berliner XML Tage.
  • Linguistics. Manchester, 673-680.
  • Müller C. & Strube M. (2006). Multi-Level Annotation of Linguistic Data with MMAX2. In:
  • of the LREC-Workshop on XML Based Richly Annotated Corpora, Lisbon 2004. Paris:
  • Pajas P. & Štěpánek J. (2008). Recent Advances in a Feature-Rich Framework for Treebank
  • Pedagogy. Frankfurt: Peter Lang, 197–214.
  • Schmidt T. (2004). Transcribing and Annotating Spoken Language with Exmaralda. In: Proc.
  • Stuttgart University.
  • Zipser F. & Romary L. (2010). A model oriented approach to the mapping of annotation


Cite as