Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain

doi:10.5281/zenodo.1041881

Published September 30, 2017 | Version v1

Conference paper Open

Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain

1. University of Duisburg-Essen, Germany

We present a series of experiments to fit a part-of-speech (PoS) tagger towards tagging extremely infrequent PoS tags of which we only have a limited amount of training data. The objective is to implement a tagger that tags this phenomenon with a high degree of correctness in order to be able to use it as a corpus query tool on plain text corpora, so that new instances of this phenomenon can be easily found in plain text. We focused on avoiding manual annotation as much as possible and experimented with altering the frequency weight of the PoS tag of interest in the small training data set we have. This approach was compared to adding machine tagged training data in which only the phenomenon of interest is manually corrected. We find that adding more training data is unavoidable but machine tagging data and hand correcting the tag of interest is sufficient. Furthermore, the choice of the tagger plays an important role as some taggers are equipped to deal with rare phenomena more adequately than others. The best trade off between precision and recall of the phenomenon of interest was achieved by a separation of the tagging into two steps An evaluation of this phenomenon-fitted tagger on social media plain-text confirmed that the tagger serves as a useful corpus query tool that retrieves instances of the phenomenon including many unseen ones.

Files

cmccorpora17-4.pdf

Files (96.3 kB)

Name	Size	Download all
cmccorpora17-4.pdf md5:5b2ffa534454f3b287ab3a7dea1a7fb8	96.3 kB	Preview Download

Additional details

Is part of: Conference proceeding: 10.5281/zenodo.1040713 (DOI)

125

Views

Downloads

Show more details

	All versions	This version
Views	125	125
Downloads	76	76
Data volume	8.3 MB	8.3 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

cmc-corpora conference series

Imprint

Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17).

Conference

5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17) , Bolzano, Italy, 3-4 October 2017

Languages

English

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: November 4, 2017
Modified: August 3, 2024

Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain

Creators

Description

Files

cmccorpora17-4.pdf

Files (96.3 kB)

Additional details

Related works