Cross-lingual embeddings for hate speech detection in comments
Description
With the recent explosion of social media content, the amount of online hate speech is increasing, making it impossible to filter it manually. For automatic hate speech detection, a lot of annotated data is needed, which is mostly available for high-resource languages. In spite of data scarcity in low-resource languages, we want to detect hate speech in those languages. We use cross-lingual embeddings to achieve an acceptable performance in hate speech detection in a target language, using data from another language. We use hate speech comments from English, German, and Croatian. We use fastText word embeddings, align them with the RCSLS method, and achieve reasonable performance in 2 out of 6 language pairs. With Multilingual BERT, we improve upon this method, and achieve acceptable performance in 3 out of 6 language pairs.
Files
Marinšek_Rok_-_Uporaba_medjezičnih_vektorskih_vložitev_za_odkrivanje_sovražnega_govora_v_komenta.pdf
Files
(664.1 kB)
Name | Size | Download all |
---|---|---|
md5:c1732a35261b1202151d9619d7177cb2
|
664.1 kB | Preview Download |