Cross-lingual embeddings for hate speech detection in comments

doi:10.5281/zenodo.3894645

Published November 15, 2019 | Version v1

Thesis Open

Cross-lingual embeddings for hate speech detection in comments

Marinšek, Rok¹

1. University of Ljubljana, Ljubljana, Slovenia

Contributors

Supervisor:

Robnik-Šikonja, Marko¹

1. University of Ljubljana, Ljubljana, Slovenia

With the recent explosion of social media content, the amount of online hate speech is increasing, making it impossible to filter it manually. For automatic hate speech detection, a lot of annotated data is needed, which is mostly available for high-resource languages. In spite of data scarcity in low-resource languages, we want to detect hate speech in those languages. We use cross-lingual embeddings to achieve an acceptable performance in hate speech detection in a target language, using data from another language. We use hate speech comments from English, German, and Croatian. We use fastText word embeddings, align them with the RCSLS method, and achieve reasonable performance in 2 out of 6 language pairs. With Multilingual BERT, we improve upon this method, and achieve acceptable performance in 3 out of 6 language pairs.

Files

Marinšek_Rok_-_Uporaba_medjezičnih_vektorskih_vložitev_za_odkrivanje_sovražnega_govora_v_komenta.pdf

Files (664.1 kB)

Name	Size	Download all
Marinšek_Rok_-_Uporaba_medjezičnih_vektorskih_vložitev_za_odkrivanje_sovražnega_govora_v_komenta.pdf md5:c1732a35261b1202151d9619d7177cb2	664.1 kB	Preview Download

Additional details

EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153: European Commission

	All versions	This version
Views	108	108
Downloads	58	58
Data volume	39.8 MB	39.8 MB

Cross-lingual embeddings for hate speech detection in comments

Creators

Contributors

Supervisor:

Description

Files

Marinšek_Rok_-_Uporaba_medjezičnih_vektorskih_vložitev_za_odkrivanje_sovražnega_govora_v_komenta.pdf

Files (664.1 kB)

Additional details

Funding