
This folder contains the dataset for the paper:
Soumi Dutta, Vibhash Chandra, Kanav Mehra, Asit Kumar Das, Tanmoy Chakraborty, Saptarshi Ghosh. Ensemble Algorithms for Microblog Summarization. IEEE Intelligent Systems (Special Issue on Summarization of Things), vol. 33, no. 3, pp. 4--14, May/June 2018. 


* This folder contains, apart from this README file, the following:

+ Folder "input_datasets" containing four text files, giving the four tweet datasets used in the paper:
(1) hblast_input_data.txt - 1413 distinct tweets related to Bomb blasts in Hyderabad, India
(2) hagupit_input_data.txt - 1461 distinct tweets related to Typhoon Hagupit in Phillipines
(3) uflood_input_data.txt - 2069 distinct tweets related to Floods in Uttaranchal state of India
(4) sandyhook_input_data.txt - 2080 distinct tweets related to Sandy Hook elementary school shooting in USA


+ As specified in the paper, three human annotators were asked to independently summarize each of the four tweet datasets. The folder "gold-standard-summaries" contain three sub-folders "annotator1", "annotator2", and "annotator3". The sub-folder "annotator1" contains four text files, giving the gold standard summaries (extractive) for the four datasets generated by Annotator 1. Similarly the sub-folders "annotator2" and "annotator3" respectively contain the gold standard summaries generated by Annotator 2 and Annotator 3. The dataset for which a particular file contains a gold standard summary should be evident from the name of the file. 


Please note the following:

* Each file in all the folders mentioned above contains the text of one tweet per line. Each line is terminated by a stop (.) which may be ignored if desired. 

* The annotators were initially asked to write extractive summaries of length 30 tweets for each dataset. In other words, they were asked to select 30 tweets covering all the important information in a particular dataset. However, in some cases, the annotators felt that 30 tweets were not sufficient to cover all important information in a dataset. In such cases, the annotators were allowed to select a slightly higher number of tweets in order to cover all important information in a dataset. Hence the gold standard summaries stated above may have variable numbers of tweets. 


* If you wish to use the dataset, kindly cite the following paper: 
Soumi Dutta, Vibhash Chandra, Kanav Mehra, Asit Kumar Das, Tanmoy Chakraborty, Saptarshi Ghosh. Ensemble Algorithms for Microblog Summarization. IEEE Intelligent Systems (Special Issue on Summarization of Things), vol. 33, no. 3, pp. 4--14, May/June 2018. 


* For any further queries, contact Saptarshi Ghosh (saptarshi [dot] ghosh [at] gmail [dot] com)


