Published December 17, 2020 | Version v1
Dataset Open

MAVIS Twitter dataset: A collection of tweets and sentiment analysis in Spanish about vaccines and diseases during the period 2015-2018

  • 1. Universidad Politécnica de Madrid
  • 2. Global Medical and Scientific Affairs, MSD España
  • 3. Hospital Vithas Xanit Internacional
  • 4. HM Nens
  • 5. Universidad Rey Juan Carlos

Description

MAVIS dataset comprises a full knowledge base regarding Twitter messages published in Spanish during the period 2015-2018, in the context of sentiment analysis of specific vaccines and their related diseases. Such diseases and vaccines are summarized as follows:

  • Invasive meningococcal disease (“EMI” in Spanish): Bexsero, Trumenba, Nimenrix
  • Invasive pneumococcal disease (“ENI” in Spanish)
  • Influenza
  • Hepatitis
  • Rotavirus: Rotarix, Rotateq
  • Measles (“Sarampión” in Spanish) and MMR (“Triple vírica” in Spanish)
  • Sepsis
  • Whooping cough (“Tosferina” in Spanish)
  • Chickenpox (“Varicela” in Spanish): Varivax, Varilrix; and Shingles (“Zoster” in Spanish)
  • Human papillomavirus infection (“VPH” in Spanish): Cervarix, Gardasil

Tweets have been manually classified as having a negative or non-negative sentiment by 5 experts. Moreover, an automatic classification has been performed by 3 different tools: IBM Watson (now Watson Tone Analyzer, https://www.ibm.com/watson/services/tone-analyzer/), Google Cloud Natural Language (https://cloud.google.com/natural-language), and Meaning Cloud (https://www.meaningcloud.com/). IBM Watson and Google Cloud Natural Language returned a numerical sentiment score ranging from -1 to 1, while Meaning Cloud returned a categorical variable with the values ‘P+’, ‘P’, ‘NEU’, ‘N’ and ‘N+’, which were converted to 1, 2, 3, 4 and 5 respectively.

With these variables (IBM Watson, Google Cloud Natural Language, and Meaning Cloud annotations and the experts’ classification as the target label), a machine learning metamodel was developed. Tweets were also annotated with the sentiment output given by this classifier.   

The provided data includes intrinsic tweets information, intrinsic information regarding the users that posted the tweets, the keywords mentioned in each tweet, and the annotations that the experts, the tools, and the model gave to each tweet.

Funding: This dataset was obtained with funding from MSD, Spain under MAVIS Study (VEAP ID: 7789).

Current studies using this dataset at the moment of the publication:

  • Rodríguez-González et al., “Creating a metamodel based on machine learning to identify the sentiment of vaccine and disease-related messages in Twitter: the MAVIS study” in 2020 IEEE 33st International Symposium on Computer-Based Medical Systems (CBMS), Jul. 2020, p. 6. DOI: 10.1109/CBMS49503.2020.00053
  • Rodríguez-González et al., "Identifying Polarity in Tweets from an Imbalanced Dataset about Diseases and Vaccines Using a Meta-Model Based on Machine Learning Techniques" in Applied Sciences, 2020, 10. DOI: 10.3390/app10249019

Files

Files (623.6 MB)

Name Size Download all
md5:95306ffd9cf321db4deed9f6bd10ceac
21.9 kB Download
md5:c93d1f19cd65a25839028f4f95874f34
42.0 kB Download
md5:bb046b39676fc61fa397455036bfc1c2
487.8 MB Download
md5:6f5550e1dd4173704f72c80e94979164
32.7 MB Download
md5:a04bcd1b02ec499ad475b605e8d00208
15.3 MB Download
md5:7be7fd0c95e5ff8fd02c2996e5d4fb4d
87.7 MB Download