Pepper: Handling a multiverse of formats

doi:10.5281/zenodo.15638

Published February 24, 2011 | Version v1

Poster Open

Pepper: Handling a multiverse of formats

1. Humboldt-Universität zu Berlin, INRIA
2. Humboldt-Universität zu Berlin
3. Universität Potsdam
4. INRIA

With the rising importance of empirical data in many fields of linguistic research, we see an
increase not only in the amount of electronically available corpora, but also in the number of
tools used to make this data accessible, processable and searchable. Most of these tools have
been developed in the course of specific linguistic projects and therefore can only handle a
certain kinds of linguistic information, such as syntactic-structures (e.g. TIGERSearch, Lezius
2002), or dialogue-structures (e.g. EXMARaLDA, Schmidt 2004) etc. At the same time, each
tool uses its own, proprietary format for representing the text and its annotations. Such
formats are optimized for a specific kind of analysis and the performance of a specific
processing tool. Consequently they cannot easily be mapped onto each other. This impedes
those linguistic research questions which pre-suppose a global view on data, i.e., which
require the option to correlate, query and analyze several kinds of linguistic annotations at
once.
We present Pepper, a modularized converter framework addressing the problem that a
linguistic researcher may be limited to a small set of questions due to the tool(s) he or she
uses. Pepper is based on the meta-model Salt (Zipser & Romary 2010) and offers the
possibility of converting data from n formats into m formats with a minimal number of
necessary mappings. The pluggable architecture of Pepper allows the injection of new formats
into the framework. Pepper has no restrictions on the underlying techniques used in
representing these formats (e.g. XML, tabular-formats, bracketing-formats or mixtures
thereof).

Files

Pepper_HandlingAMultiverseOfFormats_poster.pdf

Files (384.5 kB)

Name	Size	Download all
Pepper_HandlingAMultiverseOfFormats_poster.pdf md5:f59810ce228150683d3700a4888f6bb0	384.5 kB	Preview Download

Additional details

Dipper S. (2005). XML-based Stand-off Representation and Exploitation of Multi-Level
Linguistic Annotation. In: Eckstein R., Tolksdorf R. (eds.) Berliner XML Tage.
Ide N.& Suderman K.(2007). GrAF: A Graph-based Format for Linguistic Annotations.
In: Proceedings of the Linguistic Annotation Workshop, Prague, Czech Republic.
Lezius W. (2002) Ein Suchwerkzeug für syntaktisch annotierte Textkorpora. Ph.D. thesis,
Stuttgart University.
Müller C. & Strube M. (2006). Multi-Level Annotation of Linguistic Data with MMAX2. In:
Braun S. & Kohn K. & Mukherjee J.(eds.), Corpus Technology and Language
Pedagogy. Frankfurt: Peter Lang, 197–214.
Pajas P. & Štěpánek J. (2008). Recent Advances in a Feature-Rich Framework for Treebank
Annotation. In: Proceedings of the 22nd International Conference on Computational
Linguistics. Manchester, 673-680.
Schmidt T. (2004). Transcribing and Annotating Spoken Language with Exmaralda. In: Proc.
of the LREC-Workshop on XML Based Richly Annotated Corpora, Lisbon 2004. Paris:
ELRA.
Zipser F. & Romary L. (2010). A model oriented approach to the mapping of annotation
formats using standards. In: Proceedings of the Workshop on Language Resource and
Language Technology Standards, LREC 2010. Malta.

	All versions	This version
Views	195	195
Downloads	93	93
Data volume	38.4 MB	38.4 MB

Pepper: Handling a multiverse of formats

Creators

Description

Files

Pepper_HandlingAMultiverseOfFormats_poster.pdf

Files (384.5 kB)

Additional details

References