Published September 30, 2017 | Version v1
Conference paper Open

Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain

  • 1. University of Duisburg-Essen, Germany

Description

We present a series of experiments to fit a part-of-speech (PoS) tagger towards tagging extremely infrequent PoS tags of which we only have a limited amount of training data. The objective is to implement a tagger that tags this phenomenon with a high degree of correctness in order to be able to use it as a corpus query tool on plain text corpora, so that new instances of this phenomenon can be easily found in plain text. We focused on avoiding manual annotation as much as possible and experimented with altering the frequency weight of the PoS tag of interest in the small training data set we have. This approach was compared to adding machine tagged training data in which only the phenomenon of interest is manually corrected. We find that adding more training data is unavoidable but machine tagging data and hand correcting the tag of interest is sufficient. Furthermore, the choice of the tagger plays an important role as some taggers are equipped to deal with rare phenomena more adequately than others. The best trade off between precision and recall of the phenomenon of interest was achieved by a separation of the tagging into two steps An evaluation of this phenomenon-fitted tagger on social media plain-text confirmed that the tagger serves as a useful corpus query tool that retrieves instances of the phenomenon including many unseen ones.

Files

cmccorpora17-4.pdf

Files (96.3 kB)

Name Size Download all
md5:5b2ffa534454f3b287ab3a7dea1a7fb8
96.3 kB Preview Download

Additional details

Related works

Is part of
Conference proceeding: 10.5281/zenodo.1040713 (DOI)