twitter Dataset
Creators
Description
The WE1S twitter
dataset contains 5,024,756 tweets posted to Twitter between December 6th, 2013 and June 30th, 2019. The dataset is divided into subcollections based on the query terms "humanities", "liberal arts", "stem", "science", and "science-es" (that is a query for the presence of either "science" or "sciences"). Subcollections can be identified in the dataset from the value of the metapath
property. The number of tweets in each subcollections is as follows:
- humanities: 1,705,038
- liberal-arts: 7,663
- stem: 865,156
- science: 2,089,985
- science-es: 356,914
The tweets are distributed over the following date range:
- 2013: 16,335
- 2014: 862,746
- 2015: 1,711,823
- 2016: 947,561
- 2017: 976,971
- 2018: 3,24,133
- 2019: 185,187
Collectively, the tweets represent the work of 1,886,739 distinct usernames.
Each tweet's mentions, hashtags, and links are recorded, as well the number of likes and retweets. Unlike most other WE1S datasets, the Twitter dataset does not contain extracted features. Instead, it contains the original text of the tweet (the value of the content
property, along with a tidy_tweet
property, which contains the text of the tweet after preprocessing. Tweets were preprocessed using a modified form of the WE1S preprocessing algorithm. Details can be found in the WE1S Tweet-Suite repository.
(See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")
Notes
Files
Files
(3.7 GB)
Name | Size | Download all |
---|---|---|
md5:49800c9777b0aeb0581c2ba7da98feba
|
3.7 GB | Download |