Published September 19, 2020 | Version v1
Dataset Open

CMU-MisCov19: A Novel Twitter Dataset for Characterizing COVID-19 Misinformation

  • 1. Carnegie Mellon University

Description

From conspiracy theories to fake cures and fake treatments, COVID-19 has become a hot-bed for the spread of misinformation online. It is more important than ever to identify methods to debunk and correct false information online. Detection and characterization of misinformation requires an availability of annotated datasets. Most of the published COVID-19 Twitter datasets are generic, lack annotations or labels, employ automated annotations using transfer learning or semi-supervised methods, or are not specifically designed for misinformation. Annotated datasets are either only focused on "fake news", are small in size, or have less diversity in terms of classes.

Here, we present a novel Twitter misinformation dataset called "CMU-MisCov19" with 4573 annotated tweets over 17 themes around the COVID-19 discourse. We also present our annotation codebook for the different COVID-19 themes on Twitter, along with their descriptions and examples, for the community to use for collecting further annotations. Further details related to the dataset, and our analysis based on this dataset can be found at https://arxiv.org/abs/2008.00791. In adherence to the Twitter’s terms and conditions, we do not provide the full tweet JSONs but provide a ".csv" file with the tweet IDs so that the tweets can be rehydrated. We also provide the annotations, and the date of creation for each tweet for the reproduction of the results of our analyses.

Note: If for any reason, you are not able to rehydrate all the tweets, reach out to Shahan Ali Memon at (shahan@nyu.edu).

If you use this data, please cite our paper as follows: 

"Shahan Ali Memon and Kathleen M. Carley. Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset, In Proceedings of The 5th International Workshop on Mining Actionable Insights from Social Networks (MAISoN 2020), co-located with CIKM, virtual event due to COVID-19, 2020."

Notes

If you use this dataset, please cite our recently accepted paper on "Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset" at MAISON Workshop at CIKM 2020 as follows: "Shahan Ali Memon and Kathleen M. Carley. Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset, In Proceedings of The 5th International Workshop on Mining Actionable Insights from Social Networks (MAISoN 2020), co-located with CIKM, virtual event due to COVID-19, 2020." The preprint version of the paper can found at https://arxiv.org/abs/2008.00791.

Files

CMU_MisCov19_dataset.csv.zip

Files (174.8 kB)

Name Size Download all
md5:1b04c07ba6c63051557c84039327aa61
73.4 kB Preview Download
md5:ed9eb02d8e9c74d1bd09411c383b6596
101.4 kB Preview Download

Additional details

Related works

Is supplement to
Preprint: arXiv:2008.00791 (arXiv)