Published November 15, 2019 | Version v1
Thesis Open

Cross-lingual embeddings for hate speech detection in comments

  • 1. University of Ljubljana, Ljubljana, Slovenia

Contributors

  • 1. University of Ljubljana, Ljubljana, Slovenia

Description

 

With the recent explosion of social media content, the amount of online hate speech is increasing, making it impossible to filter it manually. For automatic hate speech detection, a lot of annotated data is needed, which is mostly available for high-resource languages. In spite of data scarcity in low-resource languages, we want to detect hate speech in those languages. We use cross-lingual embeddings to achieve an acceptable performance in hate speech detection in a target language, using data from another language. We use hate speech comments from English, German, and Croatian. We use fastText word embeddings, align them with the RCSLS method, and achieve reasonable performance in 2 out of 6 language pairs. With Multilingual BERT, we improve upon this method, and achieve acceptable performance in 3 out of 6 language pairs.

Files

Marinšek_Rok_-_Uporaba_medjezičnih_vektorskih_vložitev_za_odkrivanje_sovražnega_govora_v_komenta.pdf

Additional details

Funding

EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153
European Commission