Detecting East Asian Prejudice on Social Media

Bertie Vidgen; Austin Botelho; David Broniatowski; Ella Guest; Matthew Hall; Helen Margetts; Rebekah Tromble; Zeerak Waseem; Scott Hale

doi:10.5281/zenodo.3816667

Published May 8, 2020 | Version v1

Dataset Open

Detecting East Asian Prejudice on Social Media

1. The Alan Turing Institute & the Oxford Internet Institute
2. The Oxford Internet Institute
3. The George Washington University
4. The Alan Turing Institute & The University of Manchester
5. The University of Surrey & The Alan Turing Institute
6. The Alan Turing Institute & The Oxford Internet Institute
7. The George Washington University & The Alan Turing Institute
8. University of Sheffield

This repository contains:

A deep learning model which distinguishes between Hostililty against East Asia, Criticism of East Asia, Discussion of East Asian prejudice and Neutral content. The F1 score is 0.83.
A detailed annotation codebook used for marking up the tweets.
A labelled dataset with 20,000 entries.
A dataset with all 40,000 annotations, which can be used to investigate annotation processes for abusive content moderation.
A list of thematic hashtag replacements.
Three sets of annotations for the 1,000 most used hashtags in the original database of COVID-19 related tweets. Hashtags were annotated for COVID-19 relevance, East Asian relevance and stance.

The outbreak of COVID-19 has transformed societies across the world as governments tackle the health, economic and social costs of the pandemic. It has also raised concerns about the spread of hateful language and prejudice online, especially hostility directed against East Asia. This data repository is for a classifier that detects and categorizes social media posts from Twitter into four classes: Hostility against East Asia, Criticism of East Asia, Meta-discussions of East Asian prejudice and a neutral class. The classifier achieves an F1 score of 0.83 across all four classes. We provide our final model (coded in Python), as well as a new 20,000 tweet training dataset used to make the classifier, two analyses of hashtags associated with East Asian prejudice and the annotation codebook. The classifier can be implemented by other researchers, assisting with both online content moderation processes and further research into the dynamics, prevalence and impact of East Asian prejudice online during this global pandemic.

This work is a collaboration between The Alan Turing Institute and the Oxford Internet Institute. It was funded by the Criminal JusticeTheme of the Alan Turing Institute under Wave 1 of The UKRI Strategic Priorities Fund, EPSRC Grant EP/T001569/1

Files

East-asian-prejudice-model-RoBERTa.zip

Files (1.3 GB)

Name	Size
East-asian-prejudice-model-RoBERTa.zip md5:57a727fc7b3544fa34bf47c1ef188d27	1.3 GB	Preview Download
hashtags_stanceTowardsCoronaAndEastAsia.zip md5:ef3e61fe43bff5d9dbd3a4fdcc0924fa	140.5 kB	Preview Download
hs_AsianPrejudice_20kdataset_cleaned_anonymized.tsv md5:95156c5f82645c80d9a5d3bff6d31299	15.4 MB	Download
hs_AsianPrejudice_40kdataset_cleaned_anonymized.tsv md5:e10dc78dfb6d0f8c03df76525ca1a66b	24.2 MB	Download
hs_AsianPrejudice_Codebook_vShare.pdf md5:cded948767c4d2b1cf8993e98ff184a5	425.5 kB	Preview Download
hs_AsianPrejudice_hashtagsThematicReplacements.csv md5:b692350ad22871c4e5268e41a03e1e0c	36.8 kB	Preview Download

	All versions	This version
Views	4,958	4,915
Downloads	2,724	2,720
Data volume	596.9 GB	596.8 GB

Detecting East Asian Prejudice on Social Media

Authors/Creators

Description

Files

East-asian-prejudice-model-RoBERTa.zip

Files (1.3 GB)