Antisemitism on Twitter: A Dataset for Machine Learning and Text Analytics
- 1. Indiana University Bloomington
- 2. Technical University Berlin
Description
# Institute For the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset:
The ISCA project has compiled this dataset using an annotation portal, which was used to label tweets as either antisemitic or non-antisemitic, among other labels. Please note that the annotation was done with live data, including images and the context, such as threads. The original data was sourced from annotationportal.com.
# Content:
This dataset contains 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and December 2021. The dataset is drawn from representative samples during this period with relevant keywords. 1,250 tweets (18%) meet the IHRA definition of antisemitic messages.
The dataset has been compiled within the ISCA project using an annotation portal to label tweets as either antisemitic or non-antisemitic. The original data was sourced from annotationportal.com.
The tweets' distribution of all messages by year is as follows: 1,499 (22%) from 2019, 3,716 (54%) from 2020, and 1,726 (25%) from 2021. 4,605 (66%) contain the keyword "Jews," 1,524 (22%) include "Israel," 529 (8%) feature the derogatory term "ZioNazi*," and 283 (4%) use the slur "K---s." Some tweets may contain multiple keywords.
483 out of the 4,605 tweets with the keyword "Jews" (11%) and 203 out of the 1,524 tweets with the keyword "Israel" (13%) were classified as antisemitic. 97 out of the 283 tweets using the antisemitic slur "K---s" (34%) are antisemitic. Interestingly, many tweets featuring the slur "K---s" actually call out its usage. In contrast, the majority of tweets with the derogatory term "ZioNazi*" are antisemitic, with 467 out of 529 (88%) being classified as such.
File Description:
The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:
‘TweetID’: Represents the tweet ID.
‘Username’: Represents the username who published the tweet.
‘Text’: Represents the full text of the tweet (not pre-processed).
‘CreateDate’: Represents the date the tweet was created.
‘Biased’: Represents the labeled by our annotations if the tweet is antisemitic or non-antisemitic.
‘Keyword’: Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.
Licences
Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)
R code is published under the terms of the "MIT" licence (https://opensource.org/licenses/MIT) ‘
Acknowledgements
We are grateful for the support of Indiana University’s Observatory on Social Media (OSoMe) (Davis et al. 2016) and the contributions and annotations of all team members in our Social Media & Hate Research Lab at Indiana University’s Institute for the Study of Contemporary Antisemitism, especially Grace Bland, Elisha S. Breton, Kathryn Cooper, Robin Forstenhäusler, Sophie von Máriássy, Mabel Poindexter, Jenna Solomon, Clara Schilling, and Victor Tschiskale.
This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Files
GoldStander_Dataset.zip
Files
(1.6 MB)
Name | Size | Download all |
---|---|---|
md5:31dd1da81f871f0440db7330df38ea07
|
752.4 kB | Preview Download |
md5:43bf8697470d39db411e504bf11130de
|
813.9 kB | Preview Download |