A study on real graphs of fake news spreading on Twitter
Description
*** Fake News on Twitter ***
These 5 datasets are the results of an empirical study on the spreading process of newly fake news on Twitter. Particularly, we have focused on those fake news which have given rise to a truth spreading simultaneously against them. The story of each fake news is as follow:
1- FN1: A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.
2- FN2: Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."
3- FN3: Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.
4- FN4: The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.
5- FN5: In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.
The data collection has been done in two stages that each provided a new dataset: 1- attaining Dataset of Diffusion (DD) that includes information of fake news/truth tweets and retweets 2- Query of neighbors for spreaders of tweets that provides us with Dataset of Graph (DG).
DD
DD for each fake news story is an excel file, named FNx_DD where x is the number of fake news, and has the following structure:
The structure of excel files for each dataset is as follow:
- Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet:
- User ID (user who has posted the current tweet/retweet)
- The number of published tweet/retweet by the user at the time of posting the current tweet/retweet
- Language of the tweet/retweet
- Number of followers
- Number of followings (friends)
- Date and time of posting the current tweet/retweet
- Number of like (favorite) the current tweet had been acquired before crawling it
- Number of times the current tweet had been retweeted before crawling it
- Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)
- The source (OS) of device by which the current tweet/retweet was posted
- Tweet/Retweet ID
- Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)
- Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)
- Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)
- Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)
- State of the tweet which can be one of the following forms (achieved by an agreement between the annotators):
- r : The tweet/retweet is a fake news post
- a : The tweet/retweet is a truth post
- q : The tweet/retweet is a question about the fake news, however neither confirm nor deny it
- n : The tweet/retweet is not related to the fake news (even though it contains the queries related to the rumor, but does not refer to the given fake news)
DG
DG for each fake news contains two files:
- A file in graph format (.graph) which includes the information of graph such as who is linked to whom. (This file named FNx_DG.graph, where x is the number of fake news)
- A file in Jsonl format (.jsonl) which includes the real user IDs of nodes in the graph file. (This file named FNx_Labels.jsonl, where x is the number of fake news)
Because in the graph file, the label of each node is the number of its entrance in the graph. For example if node with user ID 12345637 be the first node which has been entered into the graph file then its label in the graph is 0 and its real ID (12345637) would be at the row number 1 (because the row number 0 belongs to column labels) in the jsonl file and so on other node IDs would be at the next rows of the file (each row corresponds to 1 user id). Therefore, if we want to know for example what the user id of node 200 (labeled 200 in the graph) is, then in jsonl file we should look at row number 202.
The user IDs of spreaders in DG (those who have had a post in DD) would be available in DD to get extra information about them and their tweet/retweet. The other user IDs in DG are the neighbors of these spreaders and might not exist in DD.
Files
Files
(606.9 MB)
Name | Size | Download all |
---|---|---|
md5:d57d6f039871111e5746e6f461123d90
|
56.8 kB | Download |
md5:7c56c7e9b72e1f1a2391cd82af604d18
|
69.8 MB | Download |
md5:dda1813ccccd07187699e2633f458eba
|
14.5 MB | Download |
md5:d6a3220506f2cce316f755d16638c078
|
104.5 kB | Download |
md5:25c6098ab3985a2369fbe49ed1990c94
|
109.7 MB | Download |
md5:536b9fdfb15468507dc1c8008bf1827f
|
23.7 MB | Download |
md5:cc6cb1bf5ebdd27010e76cbef80227c5
|
70.8 kB | Download |
md5:79dac761c08c8b2adb7dbd66d4a788e0
|
135.0 MB | Download |
md5:850be8666ea2e787b40b302683f0a216
|
36.6 MB | Download |
md5:08a28cb2500ab2446477980e6f863433
|
77.1 kB | Download |
md5:589807d14caf63835148ecf4bcbc7694
|
47.1 MB | Download |
md5:565e68c7fa12d139d33f017ad10afa65
|
12.1 MB | Download |
md5:32a5c83a4672d5b6d6e975f4f9f74c97
|
124.3 kB | Download |
md5:3adb2ed8ceed78b27602b7ed0e33624b
|
137.7 MB | Download |
md5:e853d5f2f19c72c7882884d7a47c2afc
|
20.3 MB | Download |