Published November 24, 2017 | Version v1
Dataset Open

Credibility Corpus with several datasets (Twitter, Web database) in French and English

Authors/Creators

  • 1. Université Paris Est

Description

please cite this dataset by :

Nicolas Turenne. The rumour spectrum. PLoS ONE, Public Library of Science, 2018, 13 (1), pp.e0189080.1-27. ⟨10.1371/journal.pone.0189080⟩⟨hal-01691934⟩

 

The set of these datasets are made to analyze information credibility in general (rumor and disinformation for English and French documents), and occuring on the social web. Target databases about rumor, hoax and disinformation helped to collect obviously misinformation. Some topic (with keywords) helps us to made corpora from the micrroblogging platform Twitter, great provider of rumors and disinformation.

1 corpus describes Texts from the web database about rumors and disinformation. 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French). 4 corpora from Social Media Twitter randomly built (2 in English, 2 in French). 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French).

Size of different corpora :

Social Web Rumorous corpus: 1,612

French Hollande Rumorous corpus (Twitter): 371 French Lemon Rumorous corpus (Twitter): 270 English Pin Rumorous corpus (Twitter): 679 English Swine Rumorous corpus (Twitter): 1024

French 1st Random corpus (Twitter): 1000 French 2st Random corpus (Twitter): 1000 English 3st Random corpus (Twitter): 1000 English 4st Random corpus (Twitter): 1000

French Rihanna Event corpus (Twitter): 543 English Rihanna Event corpus (Twitter): 1000 French Euro2016 Event corpus (Twitter): 1000 English Euro2016 Event corpus (Twitter): 1000

A matrix links tweets with most 50 frequent words

Text data :

_id : message id body text : string text data

Matrix data :

52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 11,102 lines (each line is a message)

Hidalgo corpus: lines range 1:75 Lemon corpus : lines range 76:467 Pin rumor : lines range 468:656 swine : lines range 657:1311

random messages : lines range 1312:11103

Sample contains : French Pin Rumorous corpus (Twitter): 679 Matrix data :

52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 189 lines (each line is a message)

Files

Files (1.1 MB)

Name Size Download all
md5:1c23574795a0befaa4e57ccc5cf0952b
77.1 kB Download
md5:7dbb12f305f59ccb8856582512e31691
212.3 kB Download
md5:8263b7bcfcdcf7d9646a710ef7745f19
680.4 kB Download
md5:6eff3c7447407fc082035ee3400f2e06
102.4 kB Download

Additional details

References

  • Nicolas Turenne. The rumour spectrum. PLoS ONE, Public Library of Science, 2018, 13 (1), pp.e0189080.1-27. ⟨10.1371/journal.pone.0189080⟩. ⟨hal-01691934⟩