6815967
doi
10.5281/zenodo.6815967
oai:zenodo.org:6815967
user-covid-19
user-biohackathon
Tekumalla, Ramya
Georgia State University
Wang, Guanyu
University of Missouri
Yu, Jingyuan
Universitat Autònoma de Barcelona
Liu, Tuo
Carl von Ossietzky Universität Oldenburg
Ding, Yuning
Universität Duisburg-Essen
Artemova, Katya
NRU HSE
Tutubalina, Elena
KFU
Chowell, Gerardo
Georgia State University
A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration
Banda, Juan M.
Georgia State University
url:http://www.panacealab.org/covid19/
url:https://arxiv.org/abs/2004.03688
info:eu-repo/semantics/openAccess
Other (Public Domain)
social media
twitter
nlp
covid-19
covid19
<p><em><strong>Version 122 of the dataset. MAJOR CHANGE NOTE: The dataset files: full_dataset.tsv.gz and full_dataset_clean.tsv.gz have been split in 1 GB parts using the Linux utility called Split. So make sure to join the parts before unzipping. We had to make this change as we had huge issues uploading files larger than 2GB's (hence the delay in the dataset releases). The peer-reviewed publication for this dataset has now been published in Epidemiologia an MDPI journal, and can be accessed here: <a href="https://doi.org/10.3390/epidemiologia2030024">https://doi.org/10.3390/epidemiologia2030024</a>. Please cite this when using the dataset.</strong></em></p>
<p><strong>Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 <em>we</em> have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets.</strong></p>
<p><strong>The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,346,993,992 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (348,898,436 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: <a href="http://www.panacealab.org/covid19/">http://www.panacealab.org/covid19/</a> </strong></p>
<p><strong>More details can be found (and will be updated faster at: <a href="https://github.com/thepanacealab/covid19_twitter">https://github.com/thepanacealab/covid19_twitter</a>) and our pre-print about the dataset (<a href="https://arxiv.org/abs/2004.03688">https://arxiv.org/abs/2004.03688</a>) </strong></p>
<p><strong>As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.</strong></p>
This dataset will be updated bi-weekly at least with additional tweets, look at the github repo for these updates.
Release: We have standardized the name of the resource to match our pre-print manuscript and to not have to update it every week.
Zenodo
2022-07-10
info:eu-repo/semantics/other
3723939
user-covid-19
user-biohackathon
122
1681733721.314825
1073741824
md5:725e4bd57e45cf42525224ff5fe642e3
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-ai
1073741824
md5:0b52698f94ad1aae76c8085a731b4404
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-ad
1073741824
md5:222e3a9b3758c119db0a0e75c8f7d582
https://zenodo.org/records/6815967/files/full_dataset_clean.tsv.gz.part-ab
1073741824
md5:b6bd17bf6d19a6231a18dc85f0720ef6
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-ae
1073741824
md5:c78e59fc82f504717bc4613700fe39c7
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-aa
17024
md5:aa8d83d4799b42cac82b30a1913d2d79
https://zenodo.org/records/6815967/files/full_dataset-statistics.tsv
16418
md5:ae1815a62d7fd0215a107a05286d4075
https://zenodo.org/records/6815967/files/full_dataset_clean-statistics.tsv
24998
md5:68fc3bb17d9d07a88c307509285f92e3
https://zenodo.org/records/6815967/files/frequent_trigrams.csv
1073741824
md5:72107f23843969a96c1291cc14e57c23
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-af
126732931
md5:55df0496c95c6fcc5646cb63233beea7
https://zenodo.org/records/6815967/files/full_dataset_clean.tsv.gz.part-ad
1073741824
md5:3876d43286e718e86e47fb86ebc9f647
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-ag
11378
md5:57b90993765970058b4647214b723748
https://zenodo.org/records/6815967/files/frequent_terms.csv
1073741824
md5:a570c8240acbad68ed295d8253bd1400
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-ah
1073741824
md5:fd1ff1613d8788b28b089e056537260e
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-ab
1073741824
md5:84691f89d0a9b748b5e7d02e3362992b
https://zenodo.org/records/6815967/files/full_dataset_clean.tsv.gz.part-aa
1073741824
md5:b67f3a9ab518fb275b66eac4abc2e29c
https://zenodo.org/records/6815967/files/full_dataset_clean.tsv.gz.part-ac
320670943
md5:4fb70d66e317d722c242e79d5fba1f8b
https://zenodo.org/records/6815967/files/mentions.zip
17919
md5:648b19fd57327f61e22787dbb5e6ae6e
https://zenodo.org/records/6815967/files/frequent_bigrams.csv
192811774
md5:e2834083b91c6934a169465591bf7f1c
https://zenodo.org/records/6815967/files/hashtags.zip
824415758
md5:0a5efbd25bdf15651fe8578efc1c399e
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-ak
1073741824
md5:d81418788e8ff43d193cb57c17d2d968
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-ac
12972102
md5:beb48ae2466026778eb97ba16384f72f
https://zenodo.org/records/6815967/files/emojis.zip
1073741824
md5:9b394695b196566fcdd13f14ab2fd3f0
https://zenodo.org/records/6815967/files/full_dataset.tsv.gz.part-aj
public
http://www.panacealab.org/covid19/
Is continued by
url
https://arxiv.org/abs/2004.03688
Is supplement to
url
10.5281/zenodo.3723939
isVersionOf
doi
Epidemiologia
2
3
315-324
2022-07-10