Published July 3, 2021 | Version v1
Dataset Open

twitter Dataset


The WE1S twitter dataset contains 5,024,756 tweets posted to Twitter between December 6th, 2013 and June 30th, 2019. The dataset is divided into subcollections based on the query terms "humanities", "liberal arts", "stem", "science", and "science-es" (that is a query for the presence of either "science" or "sciences"). Subcollections can be identified in the dataset from the value of the metapath property. The number of tweets in each subcollections is as follows:

  • humanities: 1,705,038
  • liberal-arts: 7,663
  • stem: 865,156
  • science: 2,089,985
  • science-es: 356,914

The tweets are distributed over the following date range:

  • 2013: 16,335
  • 2014: 862,746
  • 2015: 1,711,823
  • 2016: 947,561
  • 2017: 976,971
  • 2018: 3,24,133
  • 2019: 185,187

Collectively, the tweets represent the work of 1,886,739 distinct usernames.

Each tweet's mentions, hashtags, and links are recorded, as well the number of likes and retweets. Unlike most other WE1S datasets, the Twitter dataset does not contain extracted features. Instead, it contains the original text of the tweet (the value of the content property, along with a tidy_tweet property, which contains the text of the tweet after preprocessing. Tweets were preprocessed using a modified form of the WE1S preprocessing algorithm. Details can be found in the WE1S Tweet-Suite repository.

(See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")


The data has been archived in jsonl format (each json document is delimited by a line break).


Files (3.7 GB)

Name Size Download all
3.7 GB Download