Dataset Open Access
Bertie Vidgen; Austin Botelho; David Broniatowski; Ella Guest; Matthew Hall; Helen Margetts; Rebekah Tromble; Zeerak Waseem; Scott Hale
This repository contains:
The outbreak of COVID-19 has transformed societies across the world as governments tackle the health, economic and social costs of the pandemic. It has also raised concerns about the spread of hateful language and prejudice online, especially hostility directed against East Asia. This data repository is for a classifier that detects and categorizes social media posts from Twitter into four classes: Hostility against East Asia, Criticism of East Asia, Meta-discussions of East Asian prejudice and a neutral class. The classifier achieves an F1 score of 0.83 across all four classes. We provide our final model (coded in Python), as well as a new 20,000 tweet training dataset used to make the classifier, two analyses of hashtags associated with East Asian prejudice and the annotation codebook. The classifier can be implemented by other researchers, assisting with both online content moderation processes and further research into the dynamics, prevalence and impact of East Asian prejudice online during this global pandemic.
This work is a collaboration between The Alan Turing Institute and the Oxford Internet Institute. It was funded by the Criminal JusticeTheme of the Alan Turing Institute under Wave 1 of The UKRI Strategic Priorities Fund, EPSRC Grant EP/T001569/1
Name | Size | |
---|---|---|
East-asian-prejudice-model-RoBERTa.zip
md5:57a727fc7b3544fa34bf47c1ef188d27 |
1.3 GB | Download |
hashtags_stanceTowardsCoronaAndEastAsia.zip
md5:ef3e61fe43bff5d9dbd3a4fdcc0924fa |
140.5 kB | Download |
hs_AsianPrejudice_20kdataset_cleaned_anonymized.tsv
md5:95156c5f82645c80d9a5d3bff6d31299 |
15.4 MB | Download |
hs_AsianPrejudice_40kdataset_cleaned_anonymized.tsv
md5:e10dc78dfb6d0f8c03df76525ca1a66b |
24.2 MB | Download |
hs_AsianPrejudice_Codebook_vShare.pdf
md5:cded948767c4d2b1cf8993e98ff184a5 |
425.5 kB | Download |
hs_AsianPrejudice_hashtagsThematicReplacements.csv
md5:b692350ad22871c4e5268e41a03e1e0c |
36.8 kB | Download |
All versions | This version | |
---|---|---|
Views | 2,956 | 2,956 |
Downloads | 1,677 | 1,677 |
Data volume | 274.4 GB | 274.4 GB |
Unique views | 2,704 | 2,704 |
Unique downloads | 1,076 | 1,076 |