Antisemitism on Twitter: A Dataset for Machine Learning and Text Analytics
- 1. Indiana University Bloomington
- 2. Technical University Berlin
Description
Dataset from the Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University:
The Social Media & Hate research lab at the Institute for the Study of Contemporary Antisemitism compiled this dataset using an annotation portal (Jikeli, Soemer, and Karali 2024), which was used to label tweets as either antisemitic or non-antisemitic, among other labels. Note that annotation was done on live data, including images and context, such as threads. All data was annotated by two experts, and all discrepancies were discussed (Jikeli et al. 2023).
Content:
This dataset contains 11311 tweets covering a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and April 2023. The dataset consists of random samples of relevant keywords during this time period. 1,953 tweets (17%) are antisemitic according to the IHRA definition of antisemitism.
The distribution of tweets by year is as follows: 1499 (13%) from 2019, 3712 (33%) from 2020, 2591 (23%) from 2021, 2644 from 2022 (23%) and 865 (8%) from 2023. 6365 (56%) contain the keyword "Jews," 4134 (37%) include "Israel," 529 (5%) feature the derogatory term "ZioNazi*," and 283 (3%) use the slur "K---s." Some tweets may contain multiple keywords.
725 out of the 6365 tweets with the keyword "Jews" (11%) and 664 out of the 4134 tweets with the keyword "Israel" (16%) were classified as antisemitic. 97 out of the 283 tweets using the antisemitic slur "K---s" (34%) are antisemitic. Interestingly, many tweets featuring the slur "K---s" actually call out its use. In contrast, the majority of tweets using the derogatory term "ZioNazi*" are antisemitic, with 467 out of 529 (88%) being classified as such.
File Description:
The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:
‘ID’: Represents the tweet ID.
‘Username’: Represents the username that posted the tweet.
‘Text’: Represents the full text of the tweet (not pre-processed).
‘CreateDate’: Represents the date on which the tweet was created.
‘Biased’: Represents the label given by our annotations as to whether the tweet is antisemitic or not.
‘Keyword’: Represents the keyword that was used in the query. The keyword can be in the text, including hashtags, mentioned users, or the username itself.
Licences
Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)
Acknowledgements
We are grateful for the support of Indiana University’s Observatory on Social Media (OSoMe) (Davis et al. 2016) and the contributions and annotations of all team members in our Social Media & Hate Research Lab at Indiana University’s Institute for the Study of Contemporary Antisemitism, especially Grace Bland, Elisha S. Breton, Kathryn Cooper, Robin Forstenhäusler, Sophie von Máriássy, Mabel Poindexter, Jenna Solomon, Clara Schilling, and Victor Tschiskale.
This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Files
GoldStandard2024.csv
Files
(3.1 MB)
Name | Size | Download all |
---|---|---|
md5:7dc2d215d5ea5a16756dde4609dd50fc
|
3.1 MB | Preview Download |
Additional details
References
- Jikeli, Gunther, Sameer Karali, Daniel Miehling and Katharina Soemer (2023): Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled Dataset of Tweets. https://arxiv.org/abs/2304.14599
- Jikeli, Gunther, Katharina Soemer and Sameer Karali (2024): Annotating live messages on social media. Testing the efficiency of the AnnotHate – live data annotation portal. Journal of Computational Social Science 7, 571–585 (2024).