Published December 4, 2014 | Version v1
Publication Open

The definition of tokens in relation to words and annotation tasks

  • 1. ROR icon Universität Hamburg

Description

Tokens are the basic units of annotations. When working with corpora of non-standardized texts, tokenization is often problematic, as the usage of whitespace can vary. We show examples of how decisions in the tokenization process can influence an annotation and argue that the principles underlying the tokenization should be grounded in theoretical concepts selected on the basis of the annotation task. We present a corpus of Early New High German texts in which the annotation layers reference two different concepts of words: syntactic words and graphematic words. Consequently, we use two kinds of tokens: graphic tokens and syntactic tokens.

Files

258_tlt13-proceedings.pdf

Files (115.6 kB)

Name Size Download all
md5:73a022a7cb557a11a50580c131422248
115.6 kB Preview Download