Published June 10, 2022 | Version 1.0
Dataset Open

CoAID dataset with multiple extracted features (both sparse and dense)

Authors/Creators

  • 1. Laboratoire L3i, Université de La Rochelle

Description

This is a publication of the CoAID dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.

Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.

In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.

Features are extracted using:

- A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]

- A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]

- A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) [3]

- A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) [4]

References:

[1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406

[2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.

[3]: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1

[4]: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Files

coaid_test.csv

Files (2.3 GB)

Name Size Download all
md5:27fc86170218cbd867ead535511c2a7c
2.3 MB Preview Download
md5:deaf5d138257f783415bc61a10748b98
397.2 MB Preview Download
md5:dd09cc2ba7f7d25264cad2d18ca55333
29.2 MB Preview Download
md5:7ad6d677a573164951d0c538c23878d7
27.9 MB Preview Download
md5:1abde16683cbc8254011ba181e45d9a4
253.5 MB Preview Download
md5:53669b76584fc15f403adaccd78721b2
53.0 MB Preview Download
md5:669186246face0e9684da3750f9152c5
890.3 MB Preview Download
md5:38eeef88bf85a79786b2bffc79458234
55.9 MB Preview Download
md5:53669b76584fc15f403adaccd78721b2
53.0 MB Preview Download
md5:07f3e2f6adbca95950753751d66c75ce
569.5 MB Preview Download

Additional details

Funding

European Commission
NewsEye - NewsEye: A Digital Investigator for Historical Newspapers 770299