Published May 21, 2020 | Version v1
Conference paper Open

Linked data2

  • 1. ESI
  • 2. HBKU

Description

This dataset is dedicated for the task of textual user linkage across social networks. It is produced by the ULSN framework (User Linkage in Social Networks https://github.com/banyous/quora-twitter-scrapping). The dataset is structured into two folders ( Twitter/ and Quora/). Each folder contains 27K files representing each 27K different social accounts (users).

Each user has one file in each folder with the same 7 digit filename identifier (e.g. /Twitter/1004069.txt , Quora/1004069.txt ). Each file contains data about the user's public posts (tweets or answers) in each social platform (Twitter or Quora).

The structure of each twitter file is as follows : The first line represents the user's bio section text. If the user has no profile bio than the line is empty.  The remaining lines represent the user's tweets. Each line represent a single tweet and has 5 space separated fields : Tweet Id | Tweet Date (YYYY-MM-DD) | Tweet Time (HH:MM:SS) | GMT Zone (+/-H) |  Tweet Text. 

The structure of each Quora file is as follows : The first line represents the user's description section text. If the user has no profile description than the line is empty. The remaining lines represent the user's answers. Each line represent a single answer and has 3 tab separated fields : Answer Date (YYYY-MM-DD) | Question ID | Answer text.

All the 27,049 collected users have at least 10 tweets and 1 answer.

It is worth noting that the dataset can be exploited for other user related research problems such as : user de-anonymization,  experts retrieval, social opinion mining and sentiment analysis on Quora and Twitter.   

Files

ULSN_Quora_Twitter.zip

Files (2.4 GB)

Name Size Download all
md5:dd4fbeeb3a0b2d8f0a129d2a9f655328
2.4 GB Preview Download