Published December 4, 2014
| Version v1
Publication
Open
The definition of tokens in relation to words and annotation tasks
Description
Tokens are the basic units of annotations. When working with corpora of non-standardized texts, tokenization is often problematic, as the usage of whitespace can vary. We show examples of how decisions in the tokenization process can influence an annotation and argue that the principles underlying the tokenization should be grounded in theoretical concepts selected on the basis of the annotation task. We present a corpus of Early New High German texts in which the annotation layers reference two different concepts of words: syntactic words and graphematic words. Consequently, we use two kinds of tokens: graphic tokens and syntactic tokens.
Files
258_tlt13-proceedings.pdf
Files
(115.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:73a022a7cb557a11a50580c131422248
|
115.6 kB | Preview Download |