Published January 1, 2023 | Version v1
Book chapter Open

ERIS: issues in using HurtLex and Best-Worst Scaling annotation to develop a lexical resource for Modern Greek offensive language detection

  • 1. Computer Science Department, University of Turin
  • 2. Athena-Research and Innovation Center in Information, Communication and Knowledge Technologies

Description

ERIS1, a lexical resource of Modern Greek (hereinafter: EL) for offensive language (OL) detection2, is the result of cleansing, enriching and assigning graded offensiveness values to the EL branch of HurtLex (hereinafter: HurtLex(EL)) (Bassignana et al., 2018). We present how ERIS was developed and discuss the experience gained from the selected annotation procedure.

HurtLex is a domain-independent lexicon with offensive, aggressive and hateful words from 53 languages that, among others, aims to support the development of resources for under represented languages (Bassignana et al. 2018:5). Its kernel consists of ∼1000 manually selected words corresponding to 17 non-mutually exclusive thematic categories that were enriched in a semi-automatic manner by drawing on the MultiWodrnet synsets3 and Babelnet.4 In HurtLex each lemma-sense pair is classified as “non-offensive",“neutral" or “offensive".
We use the term OL without making a distinction between offensive and hate language because both

a line between them is difficult to be drawn and terms in the two domains are used interchangeably (Davidson et al. 2017; Waseem et al. 2017). While the phenomena of offensive and hateful speech are related but not completely overlapping (Poletto et al., 2021), lexicons of offensive terms have been successfully employed to boost the performance of hate speech classifiers (Koufakou et al., 2020).

The literature on EL OL detection does not provide annotated corpora representing several registers, sizeable OL lexica or annotation methods and guidelines. The Offensive Greek Tweet Dataset (OGTD) (Pitenis et al., 2020) containing tweets marked as “offensive”, “not offensive” or “spam”, was extracted with a (unpublished) list of profane or obscene keywords. Work on racism draws on a published annotated dataset containing 4004 tweets (Perifanos and Goutsos, 2021) and work on terrorist argument (Lekea and Karampelas, 2018) on an unpublished list of 1265 words. Efthymiou et al. (2014) and Christopoulou (2012), among others, discuss lexicographic issues concerning OL but have not published sizeable annotated lexical resources. We have worked on lexical resources because many of the studies referring to OL detection use vocabularies (Chen et al. 2012; Njagi et al. 2015). There are strong indications that key-word and lexicon- based approaches score better when there is a shortage of annotated corpora (Sazzed, 2021). Furthermore, lexica can be used to leverage corpora (Plaza-del Arco1 et al., 2022).

Files

ERIS_chapter.pdf

Files (135.6 kB)

Name Size Download all
md5:ad489c4ff3703b1586f6fd77892e2746
135.6 kB Preview Download