ABOME: A Multi-platform Data Repository of Artificially Boosted Online Media Entities
Description
Motivation
The rise of online media has enabled users to choose various unethical and artificial ways of gaining social growth to boost their credibility (number of followers/retweets/views/likes/subscriptions) within a short time period. In this work, we present ABOME, a novel data repository consisting of datasets collected from multiple platforms for the analysis of blackmarket-driven collusive activities, which are prevalent but often unnoticed in online media. ABOME contains data related to tweets and users on Twitter, YouTube videos, YouTube channels. We believe ABOME is a unique data repository that one can leverage to identify and analyze blackmarket based temporal fraudulent activities in online media as well as the network dynamics.
License
Creative Commons License.
Description of the dataset
In this work, we focused on collecting data from credit-based freemium services. We divide the datasets into two parts:
- Historical Data (historical_anon.zip)
This consists of all the data for Twitter and YouTube from blackmarket services gathered via sequential querying of the website’s URLs between the period March-June, 2019. We collected the metadata of each entity present in the historical data.
Twitter:
We collected the following fields for retweets and followers on Twitter:
user_details
: A JSON object representing a Twitter user.
tweet_details
: A JSON object representing a tweet.
tweet_retweets
: A JSON list of tweet objects representing the most recent 100 retweets of a given tweet.
-
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object↩︎
-
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object↩︎
YouTube:
We collected the following fields for YouTube likes and comments:
is_family_friendly:
Whether the video is marked as family friendly or not.
genre:
Genre of the video.
duration:
Duration of the video in ISO 8601 format (duration type). This format is generally used when the duration denotes the amount of intervening time in a time interval.
description:
Description of the video.
upload_date:
Date that the video was uploaded.
is_paid:
Whether the video is paid or not.
is_unlisted:
The privacy status of the video, i.e., whether the video is unlisted or not. Here, the flag unlisted indicates that the video can only be accessed by people who have a direct link to it.
statistics:
A JSON object containing the number of dislikes, views and likes for the video.
comments:
A list of comments for the video. Each element in the list is a JSON object of the text (the comment text) and time (the time when the comment was posted).
We collected the following fields for YouTube channels:
channel_description:
Description of the channel.
hidden_subscriber_count:
Total number of hidden subscribers of the channel.
published_at:
Time when the channel was created. The time is specified in ISO 8601 format (YYYY-MM-DDThh:mm:ss.sZ).
video_count:
Total number of videos uploaded to the channel.
subscriber_count:
Total number of subscribers of the channel.
view_count:
The number of times the channel has been viewed.
kind:
The API resource type (e.g., youtube#channel for YouTube channels).
country:
The country the channel is associated with.
comment_count:
Total number of comments the channel has received.
etag:
The ETag of the channel which is an HTTP header used for web browser cache validation.
The historical data is stored in five directories named according to the type of data inside it. Each directory contains JSON files corresponding to the data described above. 'historical_sample.zip' contains a small sample of the historical dataset.
- Time-series Data (time_series_anon.zip)
This consists of time-series data (collected every 8 hours) of Twitter users and tweets collected from the blackmarket services between the period of March-June, 2019. We collect the following time-series data for retweets and followers on Twitter:
user_timeline
: This is a JSON list of tweet objects in the user’s timeline, which consists of the tweets posted, retweeted and quoted by the user. The file created at each time interval contains the new tweets posted by the user during each time interval.
user_followers
: This is a JSON file containing the user ids of all the followers of a user that were added or removed from the follower list during each time interval.
user_followees
: This is a JSON file consisting of the user ids of all the users followed by a user, i.e., the followees of a user, that were added or removed from the followee list during each time interval.
tweet_details
: This is a JSON object representing a given tweet, collected after every time interval.
tweet_retweets
: This is a JSON list of tweet objects representing the most recent 100 retweets of a given tweet, collected after every time interval.
The time-series data is stored in directories named according to the timestamp of the collection time. Each directory contains sub-directories corresponding to the data described above. 'time_series_sample.zip' contains a small sample of the time series dataset.
Data Anonymization
The data is anonymized by removing all Personally Identifiable Information (PII) and generating pseud-IDs corresponding to the original IDs. A consistent mapping between the original and pseudo-IDs is maintained to maintain the integrity of the data.
Files
historical_anon.zip
Files
(2.2 GB)
Name | Size | Download all |
---|---|---|
md5:5633d41e4d68d1cc799dc1730e873c0f
|
385.9 MB | Preview Download |
md5:43db8336f5d257a4024fd0c1b3bd7fb2
|
4.6 kB | Preview Download |
md5:cd3790560c3169a2e4efc5206968d027
|
4.6 kB | Preview Download |
md5:af0e2596fd4e9e748d5cd9f1a8fb6a51
|
1.8 GB | Preview Download |
md5:91c2c9f611a5bf889967d4b8a315ecad
|
21.5 kB | Preview Download |