Augmented dataset of rumours and non-rumours for rumour detection

Sooji Han; Jie Gao; Fabio Ciravegna

doi:10.5281/zenodo.3269768

Published July 5, 2019 | Version 2.0

Dataset Open

Augmented dataset of rumours and non-rumours for rumour detection

1. University of Sheffield

This data set contains a collection of Twitter rumours and non-rumours during six real-world events: 1) 2013 Boston marathon bombings, 2) 2014 Ottawa shooting, 3) 2014 Sydney siege, 4) 2015 Charlie Hebdo Attack, 5) 2014 Ferguson unrest, and 6) 2015 Germanwings plane crash

The data set is an augmented data set of the PHEME dataset of rumours and non-rumours based on two data sets: the PHEME data [2] (downloaded via https://figshare.com/articles/PHEME_dataset_for_Rumour_Detection_and_Veracity_Classification/6392078), and the CrisisLexT26 data [3] (downloaded via https://github.com/sajao/CrisisLex/tree/master/data/CrisisLexT26/2013_Boston_bombings).

PHEME-Aug v2.0 (aug-rnr-data_filtered.tar.bz2 and aur-rnr-data_full.tar.bz2) contains augmented data for all six events.

aug-rnr-data_full.tar.bz2 contains source tweets and replies without temporal filtering. Please refer to [1] for details about temporal filtering. The statistics are as follows:

* 2013 Boston marathon bombings: 392 rumours and 784 non-rumours

* 2014 Ottawa shooting: 1,047 rumours and 2,072 non-rumours

* 2014 Sydney siege: 1,764 rumours and 3,530 non-rumours

* 2015 Charlie Hebdo Attack: 1,225 rumours and 2,450 non-rumours

* 2014 Ferguson unrest: 737 rumours and 1,476 non-rumours

* 2015 Germanwings plane crash: 502 rumours and 604 non-rumours

aug-rnr-data_filtered.tar.bz2 contains source tweets, replies, and retweets after temporal filtering and deduplication. Please refer to [1] for details. The statistics are as follows:

* 2013 Boston marathon bombings: 323 rumours and 645 non-rumours

* 2014 Ottawa shooting: 713 rumours and 1,420 non-rumours

* 2014 Sydney siege: 1,134 rumours and 2,262 non-rumours

* 2015 Charlie Hebdo Attack: 812 rumours and 1,673 non-rumours

* 2014 Ferguson unrest: 471 rumours and 949 non-rumours

* 2015 Germanwings plane crash: 375 rumours and 402 non-rumours

The data structure follows the format of the PHEME data [2]. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘aug_complete.csv’ and ‘reference.csv'.

'aug_complete.csv' file contains the metadata (tweet ID, tweet text, timestamp, and rumour label) of augmented tweets before deduplication and filtering tweets without context (i.e., replies).

'reference.csv' file contains manually annotated reference tweets [2, 3].

If you use our augmented data (PHEME-Aug v2.0), please also cite:

[1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019

==============================================================================================

[2] Kochkina, E., Liakata, M., & Zubiaga, A. (2018). All-in-one: Multi-task Learning for Rumour Verification. COLING.

[3] Olteanu, A., Vieweg, S., & Castillo, C. (2015, February). What to expect when the unexpected happens: Social media communications across crises. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 994-1009). ACM

Files

Files (258.7 MB)

Name	Size	Download all
aug-rnr-data.tar.bz2 md5:05017d1d69563e01c372828aff5ccd80	650.4 kB	Download
aug-rnr-data_filtered.tar.bz2 md5:6c7b358e5ee6c68fe3403ab4faf60e01	183.3 MB	Download
aug-rnr-data_full.tar.bz2 md5:7f0c0dc3ca426f3e0d13427c004da00e	74.8 MB	Download

Additional details

Han, S., Gao, J., Ciravegna, F. (2019). "Data Augmentation for Rumor Detection Using Context-Sensitive Neural Language Model With Large-Scale Credibility Corpus", Seventh International Conference on Learning Representations (ICLR) LLD, May 2019, New Orleans, Louisiana, US
Kochkina, E., Liakata, M., & Zubiaga, A. (2018). All-in-one: Multi-task Learning for Rumour Verification. COLING.
Olteanu, A., Vieweg, S., & Castillo, C. (2015, February). What to expect when the unexpected happens: Social media communications across crises. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 994-1009). ACM.

	All versions	This version
Views	3,268	1,353
Downloads	609	340
Data volume	49.5 GB	49.3 GB

Augmented dataset of rumours and non-rumours for rumour detection

Creators

Description

Files

Files (258.7 MB)

Additional details

References