We are releasing a Twitter dataset connected to our project Digital Narratives of Covid-19 (DHCOVID) that -among other goals- aims to explore during one year (May 2020-2021) the narratives behind data about the coronavirus pandemic.
In this first version, we deliver a Twitter dataset organized as follows:
- Each folder corresponds to daily data (one folder for each day): YEAR-MONTH-DAY
- In every folder there are 9 different plain text files named with "dhcovid", followed by date (YEAR-MONTH-DAY), language ("en" for English, and "es" for Spanish), and region abbreviation ("fl", "ar", "mx", "co", "pe", "ec", "es"):
- dhcovid_YEAR-MONTH-DAY_es_fl.txt: Dataset containing tweets geolocalized in South Florida. The geo-localization is tracked by tweet coordinates, by place, or by user information.
- dhcovid_YEAR-MONTH-DAY_en_fl.txt: We are gathering only tweets in English that refer to the area of Miami and South Florida. The reason behind this choice is that there are multiple projects harvesting English data, and, our project is particularly interested in this area because of our home institution (University of Miami) and because we aim to study public conversations from a bilingual (EN/ES) point of view.
- dhcovid_YEAR-MONTH-DAY_es_ar.txt: Dataset containing tweets from Argentina.
- dhcovid_YEAR-MONTH-DAY_es_mx.txt: Dataset containing tweets from Mexico.
- dhcovid_YEAR-MONTH-DAY_es_co.txt: Dataset containing tweets from Colombia.
- dhcovid_YEAR-MONTH-DAY_es_pe.txt: Dataset containing tweets from Perú.
- dhcovid_YEAR-MONTH-DAY_es_ec.txt: Dataset containing tweets from Ecuador.
- dhcovid_YEAR-MONTH-DAY_es_es.txt: Dataset containing tweets from Spain.
- dhcovid_YEAR-MONTH-DAY_es.txt: This dataset contains all tweets in Spanish, regardless of its geolocation.
For English, we collect all tweets with the following keywords and hashtags: covid, coronavirus, pandemic, quarantine, stayathome, outbreak, lockdown, socialdistancing. For Spanish, we search for: covid, coronavirus, pandemia, quarentena, confinamiento, quedateencasa, desescalada, distanciamiento social.
The corpus of tweets consists of a list of Tweet Ids; to obtain the original tweets, you can use "Twitter hydratator" which takes the id and download for you all metadata in a csv file.
We started collecting this Twitter dataset on April 24th, 2020 and we are adding daily data to our GitHub repository. There is a detected problem with file 2020-04-24/dhcovid_2020-04-24_es.txt, which we couldn't gather the data due to technical reasons.
For more information about our project visit https://covid.dh.miami.edu/
For more updated datasets and detailed criteria, check our GitHub Repository: https://github.com/dh-miami/narratives_covid19/