ABOME: A Multi-platform Data Repository of Artificially Boosted Online Media Entities

Hridoy Sankar Dutta; Udit Arora; Tanmoy Chakraborty

doi:10.5281/zenodo.3609250

Published January 15, 2021 | Version v1

Dataset Restricted

ABOME: A Multi-platform Data Repository of Artificially Boosted Online Media Entities

Motivation

The rise of online media has enabled users to choose various unethical and artificial ways of gaining social growth to boost their credibility (number of followers/retweets/views/likes/subscriptions) within a short time period. In this work, we present ABOME, a novel data repository consisting of datasets collected from multiple platforms for the analysis of blackmarket-driven collusive activities, which are prevalent but often unnoticed in online media. ABOME contains data related to tweets and users on Twitter, YouTube videos, YouTube channels. We believe ABOME is a unique data repository that one can leverage to identify and analyze blackmarket based temporal fraudulent activities in online media as well as the network dynamics.

License

Creative Commons License.

Description of the dataset

- Historical Data

We collected the metadata of each entity present in the historical data

Twitter:

We collected the following fields for retweets and followers on Twitter:

user_details: A JSON object representing a Twitter user.

tweet_details: A JSON object representing a tweet.

tweet_retweets: A JSON list of tweet objects representing the most recent 100 retweets of a given tweet.

YouTube:

We collected the following fields for YouTube likes and comments:

is_family_friendly: Whether the video is marked as family friendly or not.

genre: Genre of the video.

duration: Duration of the video in ISO 8601 format (duration type). This format is generally used when the duration denotes the amount of intervening time in a time interval.

description: Description of the video.

upload_date: Date that the video was uploaded.

is_paid: Whether the video is paid or not.

is_unlisted: The privacy status of the video, i.e., whether the video is unlisted or not. Here, the flag unlisted indicates that the video can only be accessed by people who have a direct link to it.

statistics: A JSON object containing the number of dislikes, views and likes for the video.

comments: A list of comments for the video. Each element in the list is a JSON object of the text (the comment text) and time (the time when the comment was posted).

We collected the following fields for YouTube channels:

channel_description: Description of the channel.

hidden_subscriber_count: Total number of hidden subscribers of the channel.

published_at: Time when the channel was created. The time is specified in ISO 8601 format (YYYY-MM-DDThh:mm:ss.sZ).

video_count: Total number of videos uploaded to the channel.

subscriber_count: Total number of subscribers of the channel.

view_count: The number of times the channel has been viewed.

kind: The API resource type (e.g., youtube#channel for YouTube channels).

country: The country the channel is associated with.

comment_count: Total number of comments the channel has received.

etag: The ETag of the channel which is an HTTP header used for web browser cache validation.

The historical data is stored in five directories named according to the type of data inside it. Each directory contains json files corresponding to the data described above.

- Time-series Data

We collect the following time-series data for retweets and followers on Twitter:

user_timeline: This is a JSON list of tweet objects in the user’s timeline, which consists of the tweets posted, retweeted and quoted by the user. The file created at each time interval contains the new tweets posted by the user during each time interval.

user_followers: This is a JSON file containing the user ids of all the followers of a user that were added or removed from the follower list during each time interval.

user_followees: This is a JSON file consisting of the user ids of all the users followed by a user, i.e., the followees of a user, that were added or removed from the followee list during each time interval.

tweet_details: This is a JSON object representing a given tweet, collected after every time interval.

tweet_retweets: This is a JSON list of tweet objects representing the most recent 100 retweets of a given tweet, collected after every time interval.

The time-series data is stored in directories named according to the timestamp of the collection time. Each directory contains sub-directories corresponding to the data described above.

Data Anonymization

The data is anonymized by removing all Personally Identifiable Information (PII) and generating pseud-IDs corresponding to the original IDs. A consistent mapping between the original and pseudo-IDs is maintained to maintain the integrity of the data.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

	All versions	This version
Views	1,183	221
Downloads	407	8
Data volume	293.1 GB	90.8 GB

ABOME: A Multi-platform Data Repository of Artificially Boosted Online Media Entities

Creators

Description

Files

Restricted