Conference paper Open Access

Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain

Horsmann, Tobias; Beißwenger, Michael; Zesch, Torsten

We present a series of experiments to fit a part-of-speech (PoS) tagger towards tagging extremely infrequent PoS tags of which we only have a limited amount of training data. The objective is to implement a tagger that tags this phenomenon with a high degree of correctness in order to be able to use it as a corpus query tool on plain text corpora, so that new instances of this phenomenon can be easily found in plain text. We focused on avoiding manual annotation as much as possible and experimented with altering the frequency weight of the PoS tag of interest in the small training data set we have. This approach was compared to adding machine tagged training data in which only the phenomenon of interest is manually corrected. We find that adding more training data is unavoidable but machine tagging data and hand correcting the tag of interest is sufficient. Furthermore, the choice of the tagger plays an important role as some taggers are equipped to deal with rare phenomena more adequately than others. The best trade off between precision and recall of the phenomenon of interest was achieved by a separation of the tagging into two steps An evaluation of this phenomenon-fitted tagger on social media plain-text confirmed that the tagger serves as a useful corpus query tool that retrieves instances of the phenomenon including many unseen ones.

Files (96.3 kB)
Name Size
cmccorpora17-4.pdf
md5:5b2ffa534454f3b287ab3a7dea1a7fb8
96.3 kB Download
3
3
views
downloads
All versions This version
Views 33
Downloads 33
Data volume 288.8 kB288.8 kB
Unique views 33
Unique downloads 22

Share

Cite as