Dataset Open Access
Erdal Baran; Dimitar Dimitrov
TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags, user mentions, and resolved URLs are exposed in RDF using established RDF/S vocabularies*.
We also provide a tab-separated values (tsv) version of the dataset. Each line contains features of a tweet instance. Features are separated by tab character ("\t"). The following list indicate the feature indices:
This dataset consists of 8,151,524 tweets in total, posted by 3,664,518 users and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until April 2020.
To extract the dataset from TweetsKB, we compiled a seed list of 268 COVID-19-related keywords.
* For the sake of privacy, we anonymize user IDs and we do not provide the text of the tweets.
Name | Size | |
---|---|---|
TweetsCOV19.n3.gz
md5:ac6bc25b9e4f7d285e6907a47c45d9e4 |
1.7 GB | Download |
TweetsCOV19.tsv.gz
md5:224c76b9f31696b4514ef1d624693ddd |
820.4 MB | Download |
All versions | This version | |
---|---|---|
Views | 3,156 | 3,156 |
Downloads | 3,464 | 3,464 |
Data volume | 4.1 TB | 4.1 TB |
Unique views | 2,804 | 2,804 |
Unique downloads | 1,400 | 1,400 |