Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A Dataset for Machine Learning and Text Analytics

Jikeli, Gunther; Karali, Sameer; Soemer, Katharina

doi:10.5281/zenodo.10812805

Published March 13, 2023 | Version v2

Dataset Open

Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A Dataset for Machine Learning and Text Analytics

1. Indiana University Bloomington

Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset on bias against Asians, Blacks, Jews, Latines, and Muslims

Description

The dataset is a product of a research project at Indiana University on biased messages on Twitter against ethnic and religious minorities. We scraped all live messages with the keywords "Asians, Blacks, Jews, Latinos, and Muslims" from the Twitter archive in 2020, 2021, and 2022.

Random samples of 600 tweets were created for each keyword and year, including retweets. The samples were annotated in subsamples of 100 tweets by undergraduate students in Professor Gunther Jikeli's class 'Researching White Supremacism and Antisemitism on Social Media' in the fall of 2022 and 2023. A total of 120 students participated in 2022. They annotated datasets from 2020 and 2021. 134 students participated in 2023. They annotated datasets from the years 2021 and 2022. The annotation was done using the Annotation Portal (Jikeli, Soemer and Karali, 2024). The updated version of our portal, AnnotHate, is now publicly available. Each subsample was annotated by an average of 5.65 students per sample in 2022 and 8.32 students per sample in 2023, with a range of three to ten and three to thirteen students, respectively. Annotation included questions about bias and calling out bias.

Annotators used a scale from 1 to 5 on the bias scale (confident not biased, probably not biased, don't know, probably biased, confident biased), using definitions of bias against each ethnic or religious group that can be found in the research reports from 2022 and 2023. If the annotators interpreted a message as biased according to the definition, they were instructed to choose the specific stereotype from the definition that was most applicable. Tweets that denounced bias against a minority were labeled as "calling out bias".

The label was determined by a 75% majority vote. We classified “probably biased” and “confident biased” as biased, and “confident not biased,” “probably not biased,” and “don't know” as not biased.

The stereotypes about the different minorities varied. About a third of all biased tweets were classified as general 'hate' towards the minority. The nature of specific stereotypes varied by group. Asians were blamed for the Covid-19 pandemic, alongside positive but harmful stereotypes about their perceived excessive privilege. Black people were associated with criminal activity and were subjected to views that portrayed them as inferior. Jews were depicted as wielding undue power and were collectively held accountable for the actions of the Israeli government. In addition, some tweets denied the Holocaust. Hispanic people/Latines faced accusations of being undocumented immigrants and "invaders," along with persistent stereotypes of them as lazy, unintelligent, or having too many children. Muslims were often collectively blamed for acts of terrorism and violence, particularly in discussions about Muslims in India.

The annotation results from both cohorts (Class of 2022 and Class of 2023) will not be merged. They can be identified by the "cohort" column. While both cohorts (Class of 2022 and Class of 2023) annotated the same data from 2021,* their annotation results differ. The class of 2022 identified more tweets as biased for the keywords "Asians, Latinos, and Muslims" than the class of 2023, but nearly all of the tweets identified by the class of 2023 were also identified as biased by the class of 2022. The percentage of biased tweets with the keyword 'Blacks' remained nearly the same.

*Due to a sampling error for the keyword "Jews" in 2021, the data are not identical between the two cohorts. The 2022 cohort annotated two samples for the keyword Jews, one from 2020 and the other from 2021, while the 2023 cohort annotated samples from 2021 and 2022.The 2021 sample for the keyword "Jews" that the 2022 cohort annotated was not representative. It has only 453 tweets from 2021 and 147 from the first eight months of 2022, and it includes some tweets from the query with the keyword "Israel". The 2021 sample for the keyword "Jews" that the 2023 cohort annotated was drawn proportionally for each trimester of 2021 for the keyword "Jews".

Content

Cohort 2022

This dataset contains 5880 tweets that cover a wide range of topics common in conversations about Asians, Blacks, Jews, Latines, and Muslims. 357 tweets (6.1 %) are labeled as biased and 5523 (93.9 %) are labeled as not biased. 1365 tweets (23.2 %) are labeled as calling out or denouncing bias.

1180 out of 5880 tweets (20.1 %) contain the keyword "Asians," 590 were posted in 2020 and 590 in 2021. 39 tweets (3.3 %) are biased against Asian people. 370 tweets (31,4 %) call out bias against Asians.

1160 out of 5880 tweets (19.7%) contain the keyword "Blacks," 578 were posted in 2020 and 582 in 2021. 101 tweets (8.7 %) are biased against Black people. 334 tweets (28.8 %) call out bias against Blacks.

1189 out of 5880 tweets (20.2 %) contain the keyword "Jews," 592 were posted in 2020, 451 in 2021, and ––as mentioned above––146 tweets from 2022. 83 tweets (7 %) are biased against Jewish people. 220 tweets (18.5 %) call out bias against Jews.

1169 out of 5880 tweets (19.9 %) contain the keyword "Latinos," 584 were posted in 2020 and 585 in 2021. 29 tweets (2.5 %) are biased against Latines. 181 tweets (15.5 %) call out bias against Latines.

1182 out of 5880 tweets (20.1 %) contain the keyword "Muslims," 593 were posted in 2020 and 589 in 2021. 105 tweets (8.9 %) are biased against Muslims. 260 tweets (22 %) call out bias against Muslims.

Cohort 2023

The dataset contains 5363 tweets with the keywords “Asians, Blacks, Jews, Latinos and Muslims” from 2021 and 2022. 261 tweets (4.9 %) are labeled as biased, and 5102 tweets (95.1 %) were labeled as not biased. 975 tweets (18.1 %) were labeled as calling out or denouncing bias.

1068 out of 5363 tweets (19.9 %) contain the keyword "Asians," 559 were posted in 2021 and 509 in 2022. 42 tweets (3.9 %) are biased against Asian people. 280 tweets (26.2 %) call out bias against Asians.

1130 out of 5363 tweets (21.1 %) contain the keyword "Blacks," 586 were posted in 2021 and 544 in 2022. 76 tweets (6.7 %) are biased against Black people. 146 tweets (12.9 %) call out bias against Blacks.

971 out of 5363 tweets (18.1 %) contain the keyword "Jews," 460 were posted in 2021 and 511 in 2022. 49 tweets (5 %) are biased against Jewish people. 201 tweets (20.7 %) call out bias against Jews.

1072 out of 5363 tweets (19.9 %) contain the keyword "Latinos," 583 were posted in 2021 and 489 in 2022. 32 tweets (2.9 %) are biased against Latines. 108 tweets (10.1 %) call out bias against Latines.

1122 out of 5363 tweets (20.9 %) contain the keyword "Muslims," 576 were posted in 2021 and 546 in 2022. 62 tweets (5.5 %) are biased against Muslims. 240 tweets (21.3 %) call out bias against Muslims.

File Description

The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:

'TweetID': Represents the tweet ID.

'Username': Represents the username who published the tweet (if it is a retweet, it will be the user who retweetet the original tweet.

'Text': Represents the full text of the tweet (not pre-processed).

'CreateDate': Represents the date the tweet was created.

'Biased': Represents the labeled by our annotators if the tweet is biased (1) or not (0).

'Calling_Out': Represents the label by our annotators if the tweet is calling out bias against minority groups (1) or not (0).

'Keyword': Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.

‘Cohort’: Represents the year the data was annotated (class of 2022 or class of 2023)

Acknowledgements

We are grateful for the technical collaboration with Indiana University's Observatory on Social Media (OSoMe). We thank all class participants for the annotations and contributions, including Kate Baba, Eleni Ballis, Garrett Banuelos, Savannah Benjamin, Luke Bianco, Zoe Bogan, Elisha S. Breton, Aidan Calderaro, Anaye Caldron, Olivia Cozzi, Daj Crisler, Jenna Eidson, Ella Fanning, Victoria Ford, Jess Gruettner, Ronan Hancock, Isabel Hawes, Brennan Hensler, Kyra Horton, Maxwell Idczak, Sanjana Iyer, Jacob Joffe, Katie Johnson, Allison Jones, Kassidy Keltner, Sophia Knoll, Jillian Kolesky, Emily Lowrey, Rachael Morara, Benjamin Nadolne, Rachel Neglia, Seungmin Oh, Kirsten Pecsenye, Sophia Perkovich, Joey Philpott, Katelin Ray, Kaleb Samuels, Chloe Sherman, Rachel Weber, Molly Winkeljohn, Ally Wolfgang, Rowan Wolke, Michael Wong, Jane Woods, Kaleb Woodworth, Aurora Young, Sydney Allen, Hundre Askie, Norah Bardol, Olivia Baren, Samuel Barth, Emma Bender, Noam Biron, Kendyl Bond, Graham Brumley, Kennedi Bruns, Leah Burger, Hannah Busche, Morgan Butrum-Griffith, Zoe Catlin, Angeli Cauley, Nathalya Chavez Medrano, Mia Cooper, Suhani Desai, Isabella Flick, Samantha Garcez, Isabella Grady, Macy Hutchinson, Sarah Kirkman, Ella Leitner, Elle Marquardt, Madison Moss, Ethan Nixdorf, Reya Patel, Mickey Racenstein, Kennedy Rehklau, Grace Roggeman, Jack Rossell, Madeline Rubin, Fernando Sanchez, Hayden Sawyer, Diego Scheker, Lily Schwecke, Brooke Scott, Megan Scott, Samantha Secchi, Jolie Segal, Katherine Smith, Constantine Stefanidis, Cami Stetler, Madisyn West, Alivia Yusefzadeh, Tayssir Aminou, Karen Fecht, Luciana Orrego-Hoyos, Hannah Pickett, and Sophia Tracy.

This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

Notes

Please notice that if you open the provided file in Excel, you might need to format the column containing the Tweet ID. We recommend the following steps: How to see the full Tweet IDs in Excel: If you want the full Tweet IDs to show in Excel, you need to open an empty Excel file first and then import the data. Open an empty Excel file Go to "Data" Select "Form Txt/CSV" Select the CSV file that you want and click "Import" On the pop-up window, click "transform Data" On the new pop-up window, mark the column with the IDs and right click on it. Go to "Change Type" and select "Text" On the new pop-up window, select "Replace current" Finally, click on "Close & Load" (upper left) Make sure that the column with the IDs is marked as text (to check: right click on column and go to "Format cells" and in "Numbers" select "Text"

Files

ClassData2022and2023.csv

Files (3.5 MB)

Name	Size	Download all
ClassData2022and2023.csv md5:6c44e5fa9e916ef16c6da92a4c87a354	3.5 MB	Preview Download

	All versions	This version
Views	1,219	552
Downloads	654	360
Data volume	2.1 GB	1.5 GB

Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A Dataset for Machine Learning and Text Analytics

Creators

Description

Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset on bias against Asians, Blacks, Jews, Latines, and Muslims

Description

Content

Cohort 2022

Cohort 2023

File Description

Acknowledgements

Notes

Files

ClassData2022and2023.csv

Files (3.5 MB)