Published April 7, 2015 | Version v1
Poster Open

From TEI to linguistic corpora using Pepper

  • 1. Humboldt-Universität zu Berlin

Description

The linguistic analysis of historical texts has an impact on several fields, like linguistics, historical
sciences, literature, philology, etc. All these fields can benefit from the reuse of data created in the
other fields. Unfortunately, there exist some technical incompatibilities between them. Many
historical texts from philological contexts are first digitised in the TEI format 1 (Text Encoding
Initiative). In addition to the storage of pure text, TEI allows for annotation of many meta data e.g.
authors, editors, codicological and paleographic information and simple linguistic annotations like
word forms or morphology. However, TEI was not designed to handle a wide range of linguistic
annotations 2 . More prevelant formats in linguistics are for instance the EXMARaLDA format 3 ,
TCF 4 or the ANNIS format 5 , which are processed by manual and (semi-) automatic annotation or
search tools.
On this poster we present Pepper 6 , a tool capable of overcoming these technical incompatibilities.
Pepper is a pluggable framework for the conversion of linguistic data from one format into another.
It utilises the linguistic meta model Salt 7 as an intermediate model. This means to convert data from
a format A to a format B, they first have to be mapped from A to Salt and second from Salt to B.
This approach reduces the number of mappings from n 2 -n (for a direct mapping) to 2n. Pepper's
pluggable architecture allows the implementation of plugins for further formats, which can
consequently be converted into each of the existing formats. In addition to many plugins for other
formats, Pepper now also contains a plugin to support the TEI format. Due to the use of the
intermediate model Salt, Pepper can convert TEI-encoded data into all formats Pepper currently
supports. On this Poster we illustrate an exemplary workflow to convert historical texts in TEI
coming from the Ridges corpus 8 to a set of linguistic formats like the EXMARaLDA format, TCF
or the ANNIS format. Pepper enables us to linguistically annotate (in a manual or automativ
manner) and search corpora that originally come from non-linguistic fields.

Files

dgfs2015_ZipserKlotzRoehrig.pdf

Files (715.0 kB)

Name Size Download all
md5:bd2c6a783eb76c0723c9c54713e774a6
715.0 kB Preview Download