Linked data2
Description
This dataset is dedicated for the task of textual user linkage across social networks. It is produced by the ULSN framework (User Linkage in Social Networks https://github.com/banyous/quora-twitter-scrapping). The dataset is structured into two folders ( Twitter/ and Quora/). Each folder contains 27K files representing each 27K different social accounts (users).
Each user has one file in each folder with the same 7 digit filename identifier (e.g. /Twitter/1004069.txt , Quora/1004069.txt ). Each file contains data about the user's public posts (tweets or answers) in each social platform (Twitter or Quora).
The structure of each twitter file is as follows : The first line represents the user's bio section text. If the user has no profile bio than the line is empty. The remaining lines represent the user's tweets. Each line represent a single tweet and has 5 space separated fields : Tweet Id | Tweet Date (YYYY-MM-DD) | Tweet Time (HH:MM:SS) | GMT Zone (+/-H) | Tweet Text.
The structure of each Quora file is as follows : The first line represents the user's description section text. If the user has no profile description than the line is empty. The remaining lines represent the user's answers. Each line represent a single answer and has 3 tab separated fields : Answer Date (YYYY-MM-DD) | Question ID | Answer text.
All the 27,049 collected users have at least 10 tweets and 1 answer.
It is worth noting that the dataset can be exploited for other user related research problems such as : user de-anonymization, experts retrieval, social opinion mining and sentiment analysis on Quora and Twitter.
Files
ULSN_Quora_Twitter.zip
Files
(2.4 GB)
Name | Size | Download all |
---|---|---|
md5:dd4fbeeb3a0b2d8f0a129d2a9f655328
|
2.4 GB | Preview Download |