MAVIS Twitter dataset: A collection of tweets and sentiment analysis in Spanish about vaccines and diseases during the period 2015-2018
Creators
- 1. Universidad Politécnica de Madrid
- 2. Global Medical and Scientific Affairs, MSD España
- 3. Hospital Vithas Xanit Internacional
- 4. HM Nens
- 5. Universidad Rey Juan Carlos
Description
MAVIS dataset comprises a full knowledge base regarding Twitter messages published in Spanish during the period 2015-2018, in the context of sentiment analysis of specific vaccines and their related diseases. Such diseases and vaccines are summarized as follows:
- Invasive meningococcal disease (“EMI” in Spanish): Bexsero, Trumenba, Nimenrix
- Invasive pneumococcal disease (“ENI” in Spanish)
- Influenza
- Hepatitis
- Rotavirus: Rotarix, Rotateq
- Measles (“Sarampión” in Spanish) and MMR (“Triple vírica” in Spanish)
- Sepsis
- Whooping cough (“Tosferina” in Spanish)
- Chickenpox (“Varicela” in Spanish): Varivax, Varilrix; and Shingles (“Zoster” in Spanish)
- Human papillomavirus infection (“VPH” in Spanish): Cervarix, Gardasil
Tweets have been manually classified as having a negative or non-negative sentiment by 5 experts. Moreover, an automatic classification has been performed by 3 different tools: IBM Watson (now Watson Tone Analyzer, https://www.ibm.com/watson/services/tone-analyzer/), Google Cloud Natural Language (https://cloud.google.com/natural-language), and Meaning Cloud (https://www.meaningcloud.com/). IBM Watson and Google Cloud Natural Language returned a numerical sentiment score ranging from -1 to 1, while Meaning Cloud returned a categorical variable with the values ‘P+’, ‘P’, ‘NEU’, ‘N’ and ‘N+’, which were converted to 1, 2, 3, 4 and 5 respectively.
With these variables (IBM Watson, Google Cloud Natural Language, and Meaning Cloud annotations and the experts’ classification as the target label), a machine learning metamodel was developed. Tweets were also annotated with the sentiment output given by this classifier.
The provided data includes intrinsic tweets information, intrinsic information regarding the users that posted the tweets, the keywords mentioned in each tweet, and the annotations that the experts, the tools, and the model gave to each tweet.
Funding: This dataset was obtained with funding from MSD, Spain under MAVIS Study (VEAP ID: 7789).
Current studies using this dataset at the moment of the publication:
- Rodríguez-González et al., “Creating a metamodel based on machine learning to identify the sentiment of vaccine and disease-related messages in Twitter: the MAVIS study” in 2020 IEEE 33st International Symposium on Computer-Based Medical Systems (CBMS), Jul. 2020, p. 6. DOI: 10.1109/CBMS49503.2020.00053
- Rodríguez-González et al., "Identifying Polarity in Tweets from an Imbalanced Dataset about Diseases and Vaccines Using a Meta-Model Based on Machine Learning Techniques" in Applied Sciences, 2020, 10. DOI: 10.3390/app10249019
Files
Files
(623.6 MB)
Name | Size | Download all |
---|---|---|
md5:95306ffd9cf321db4deed9f6bd10ceac
|
21.9 kB | Download |
md5:c93d1f19cd65a25839028f4f95874f34
|
42.0 kB | Download |
md5:bb046b39676fc61fa397455036bfc1c2
|
487.8 MB | Download |
md5:6f5550e1dd4173704f72c80e94979164
|
32.7 MB | Download |
md5:a04bcd1b02ec499ad475b605e8d00208
|
15.3 MB | Download |
md5:7be7fd0c95e5ff8fd02c2996e5d4fb4d
|
87.7 MB | Download |