Published February 24, 2011 | Version v1
Poster Open

Pepper: Handling a multiverse of formats

  • 1. Humboldt-Universität zu Berlin, INRIA
  • 2. Humboldt-Universität zu Berlin
  • 3. Universität Potsdam
  • 4. INRIA

Description

With the rising importance of empirical data in many fields of linguistic research, we see an
increase not only in the amount of electronically available corpora, but also in the number of
tools used to make this data accessible, processable and searchable. Most of these tools have
been developed in the course of specific linguistic projects and therefore can only handle a
certain kinds of linguistic information, such as syntactic-structures (e.g. TIGERSearch, Lezius
2002), or dialogue-structures (e.g. EXMARaLDA, Schmidt 2004) etc. At the same time, each
tool uses its own, proprietary format for representing the text and its annotations. Such
formats are optimized for a specific kind of analysis and the performance of a specific
processing tool. Consequently they cannot easily be mapped onto each other. This impedes
those linguistic research questions which pre-suppose a global view on data, i.e., which
require the option to correlate, query and analyze several kinds of linguistic annotations at
once.
We present Pepper, a modularized converter framework addressing the problem that a
linguistic researcher may be limited to a small set of questions due to the tool(s) he or she
uses. Pepper is based on the meta-model Salt (Zipser & Romary 2010) and offers the
possibility of converting data from n formats into m formats with a minimal number of
necessary mappings. The pluggable architecture of Pepper allows the injection of new formats
into the framework. Pepper has no restrictions on the underlying techniques used in
representing these formats (e.g. XML, tabular-formats, bracketing-formats or mixtures
thereof).

Files

Pepper_HandlingAMultiverseOfFormats_poster.pdf

Files (384.5 kB)

Name Size Download all
md5:f59810ce228150683d3700a4888f6bb0
384.5 kB Preview Download

Additional details

References

  • Dipper S. (2005). XML-based Stand-off Representation and Exploitation of Multi-Level
  • Linguistic Annotation. In: Eckstein R., Tolksdorf R. (eds.) Berliner XML Tage.
  • Ide N.& Suderman K.(2007). GrAF: A Graph-based Format for Linguistic Annotations.
  • In: Proceedings of the Linguistic Annotation Workshop, Prague, Czech Republic.
  • Lezius W. (2002) Ein Suchwerkzeug für syntaktisch annotierte Textkorpora. Ph.D. thesis,
  • Stuttgart University.
  • Müller C. & Strube M. (2006). Multi-Level Annotation of Linguistic Data with MMAX2. In:
  • Braun S. & Kohn K. & Mukherjee J.(eds.), Corpus Technology and Language
  • Pedagogy. Frankfurt: Peter Lang, 197–214.
  • Pajas P. & Štěpánek J. (2008). Recent Advances in a Feature-Rich Framework for Treebank
  • Annotation. In: Proceedings of the 22nd International Conference on Computational
  • Linguistics. Manchester, 673-680.
  • Schmidt T. (2004). Transcribing and Annotating Spoken Language with Exmaralda. In: Proc.
  • of the LREC-Workshop on XML Based Richly Annotated Corpora, Lisbon 2004. Paris:
  • ELRA.
  • Zipser F. & Romary L. (2010). A model oriented approach to the mapping of annotation
  • formats using standards. In: Proceedings of the Workshop on Language Resource and
  • Language Technology Standards, LREC 2010. Malta.