Learning Multiview Representations of Twitter Users

Input views used to learn multiview Twitter user embeddings

Twitter's terms of service prevents sharing of large scale Twitter corpora. Instead, we share the
  1000-dimensional PCA vectors produced for each user's tweet and network views. These embeddings can be used in place of the user data to reproduce our methods and to compare new methods against our work.

One row per user, tab-delimited
First field is Twitter user ID
Next 6 fields at indicator features for whether this view contains data for this specific user
The final 6 fields are views, each containing a 1000-dimensional space-delimited vector

Format
The data file contains these fields in tab separated format:

  UserID
  EgoTweets
  MentionTweets
  FriendTweets
  FollowerTweets
  FriendNetwork
  FollowerNetwork

Vector dimensions are sorted in order of decreasing variance, so evaluating
a 50-dimensional PCA vector means just using the first 50 values in each view.

User IDs for user engagement and friend prediction tasks

Each row in a file corresponds to a single hashtag or celebrity.  The first field is the hashtag users posted or celebrity they follow.  All following entries are the user IDs of everyone who engaged.  The first 10 user IDs were used to compute the query embedding (rank all other user IDs by cosine similarity).  Hashtags are split into development and test, as used in the paper.