Going Beyond Content Richness: Verified Information Aware Summarization of Crisis-Related Microblogs

High-impact catastrophic events (bomb attacks, shootings) trigger posting of large volume of information on social media platforms such as Twitter. Recent works have proposed content-aware systems for summarizing this information, thereby facilitating post-disaster services. However, a significant proportion of the posted content is unverified, which restricts the practical usage of the existing summarization systems. In this paper, we work on the novel task of generating verified summaries of information posted on Twitter during disasters. We first jointly learn representations of content-classes and expression-classes of tweets posted during disasters using a novel LDA-based generative model. These representations of content & expression classes are used in conjunction with pre-disaster user behavior and temporal signals (replies) for training a Tree-LSTM based tweet-verification model. The model infers tweet verification probabilities which are used, besides information content of tweets, in an Integer Linear Programming (ILP) framework for generating the desired verified summaries. The summaries are fine-tuned using the class information of the tweets as obtained from the LDA-based generative model. Extensive experiments are performed on a publicly-available labeled dataset of man-made disasters which demonstrate the effectiveness of our tweet-verification (3-13% gain over baselines) and summarization (12-48% gain in verified content proportion, 8-13% gain in ROUGE-score over state-of-the-art) systems. We make implementations of our various modules available online.


INTRODUCTION
Over the past decade, social networking platforms such as Twitter have become important sources of real-time information especially during high-impact catastrophic events such as bomb attacks, and shootings. Recent researches have shown the potential of utilizing the boundless data accessible in facilitating post-disaster services [35,40]. Researchers have proposed robust systems for increasing the situational awareness during disasters, typically by generating event-summaries of posts published on Twitter; the systems focus on maximizing situational content in the summaries [4,28,34,35]. The effectiveness of these systems, however, gets severely restricted by the significant proportion of false & unverified tweets posted on Twitter besides the true and trustworthy facts [5,15]; the situation is worse in case of man-made disasters due to its easily exploitable psycho-social impacts on the masses (panic, stress, mental trauma). Furthermore, the unverified tweets (which may subsequently turn out to be false) are sometimes, unintentionally propagated by popular personalities (politicians, celebrities) resulting in their noteworthy attention among the masses [21].
The summary of the situation during a crisis event is required in real time so that the respective authorities/stake holders can take immediate action -this is in essence while various information are still emerging and many of the tweets are unsubstantiated. A verification-aware summarization system, thus, has to take a call on the authenticity of the individual tweets with limited secondary data to verify it. Considering the hardness of the problem, none of the existing summarization systems [4,19,28,29,34,35] explicitly make any attempt to minimize the unverified information in the summary; they only concentrate on increasing the richness of content. In this paper, we propose a novel but simple pipeline to generate verified summaries; we compute the probability of the tweet being verified (we term it as verification score) and then jointly exploit the information content & verification score for generating summaries.
For computing the verification score, we train a Tree-LSTM based architecture which can elegantly model the phenomenon of a tweet being published and several replies/counter replies being posted as a reaction. The model takes as input user's pre-disaster behavior, information about the content class of a tweet (and its reply), the manner in which the tweet has been expressed -this information is efficiently encapsulated using a novel LDA-based generative process. Note that the task of computing verification score for each tweet has few similarities with that of fake-news / fake-event detection [25,33,43]; however there are certain important differences. Fake news detection systems usually predict the credibility of a very specific piece of news which is being discussed by a large number of users; the system thus has a lot of signals with them to work with (typically 500-1000 tweets per news). Whereas due to a wide variety of news developing during a disaster, the signals for many of the tweets are inadequate mainly due to limited discussions surrounding them. Therefore, in our model, we put emphasis on exploiting the linguistic and behavioral dynamics of tweets/users for determining the authenticity of tweets ( §4). This helps us in performing much better than state-of-the-art fake-news detection algorithms; our model beats such baselines by 3-13% in terms of F1-Score on a publicly available and expert-curated labeled dataset of four man-made disaster events [49] ( §6). Further to this, we use an integer linear programming (ILP) framework for generating summary of a crisis from the tweets posted during the event. We make use of both, information content & verification scores of tweets as optimization parameters to generate a high quality summary ( §5). We perform a detailed, careful study of the output generated at various steps which helps us in iteratively fixing the weights of various components of our framework. The generated summaries of the four man-made disasters have exceptionally high proportion of verified content (12-48% gain over state-of-the-art) while still being able to maintain high rouge scores & content richness ( §7). Qualitatively analyzing these summaries helps us in understanding the robustness of our framework; the robustness is also evident in a case study of the 2019 Sri Lankan Attacks as we examine in §7.4.

RELATED WORKS
Detecting if a disaster-related information is true, false or unverified is a relatively easy task if performed a long time after the disaster; The supporting news articles and related webpages can directly provide this knowledge about the information. Unfortunately, these articles are unavailable at the time when disaster-related tweets are still emerging which makes the task of tweet verification challenging. In this section, we discuss works on the tweet verification task and the systems summarizing crisis-related tweets. Handling unverified information: The potential of social media platforms -to be used as a source of creating and spreading unverified information -has triggered a lot of work on its analysis, detection and verification [9,12]. Most of the unverified tweet detection research is focused around designing hand-crafted features from tweets such as tweet and user features [6], locations [45], multimedia [38]. Some of the other approaches use public opinion, belief identification [24], regular expression [48], temporal pattern of tweet [22,26], misinformation cascades [11,14,27]. Deep learning models (RNN) are also explored to capture verification signals [8]. More recently, researchers have analyzed unverified information posted during disasters [1,42,46,47]. Zeng et al. [47] proposed a classifier to predict the stances to unverified messages (affirmation/denial) posted during disasters. Affirmative tweets have to wait for a longer time to get retweeted by other users in contrast to denial tweets [46]. Starbird et al [37] analyzed the role of journalists in posting/correcting unverified messages during crisis events. The focus of our work is on creating a disaster-specific tweet verification model and using it for generating verified summaries of tweets posted during man-made disasters. Our analyses show that, in the context of disasters, static twitter attributes are not helpful to the tweet verification task ( §4.2). Moreover, the work of Zeng et al. [47] suggests that during disasters, the number of affirmative replies to a false tweet are greater than denial replies, thus proving insufficient for the verification task. Hence, additional signals are required; We make use of content-classes of man-made disasters and the ways of expression of tweets. Note that both these signals have not been explored by any of the prior works; the works have been limited to using standard word embeddings (eg. word2vec) which don't provide control over differentiating between the two signals. Our work is different from the tasks of fake-news detection [25,33,43] and fact-checking [41,44], both in context & the proposed challenges. Fake news detection systems usually work on a group of tweets related to a particular piece of potentially false news/event (eg. a celebrity getting married); the number of tweets for each event is usually large which provides these systems with a lot of signals to work with. A wide variety of news and a number of subevents [36] develop during a disaster, signals for many of which are inadequate. Nevertheless, we have evaluated state-of-the-art fake news detection methods [25,33] and shown that they do not adapt well in the scenario of crisis-related tweets (Table 4). Likewise, factchecking systems operate on public verbal statements of politicians & celebrities which is not in the scope of our work. Tweet summarization: Several efforts have been made by researchers for generating summaries of large tweet streams [7,18,20]. For the task of disaster-specific tweet-stream summarization, researchers have worked on maximizing situational [35], actionable [30], salient [19] and sub-event [36] content in their summaries. However, all these systems work on an inherent assumption that the posts being used by them are verified and trustworthy. In this paper, for the first time, we make use of the verified/unverified knowledge of posts in a disaster-specific tweet-stream summarization setting. The existing summarization systems optimize only on content proportion; the proposed verification-summarization system attempts to simultaneously optimize on content & verified proportions, allowing end users to retrieve verified & content-rich summaries of tweet streams.

DATASET
We use the dataset created by Zubiaga et al [49] and focus on the tweets from the following four human triggered disaster events : (i) Charlie Hebdo Shooting (Jan'15): Shootings involving killing of 11 people in the offices of the French newspaper, Charlie Hebdo, in Paris, (ii) Germanwings Plane Crash (Mar'15): Deliberate crashing of a passenger plane by a co-pilot in the French Alps, (iii) Ottawa Shooting (Oct'14): Shootings which occurred at Parliament Hill in Ottawa, and (iv) Sydney Siege (Dec'14): 10 customers and 8 employees of a Lindt chocolate cafe held hostage in Sydney. The dataset consists of highly-retweeted tweets from these disaster events. We divide the tweets in the dataset into 2 parts: source tweets and replies to the source tweets -relevant tags available in the dataset allow us to do so. Each source tweet in the dataset was labeled either verified or unverified by a team of journalists as per the scheme developed in [50]; the annotation scheme required the journalists to analyze tweets and their replies on certain parameters (supporting evidence, certainty). We note that the entire annotation process is complex; it is time consuming and requires domain expertise and thus, it was not prudent for us to create a new dataset. Rather, we reused the dataset which should allow the research community to easily compare against our results.  We consider the annotations of Zubiaga et al as gold standard. Note that this gold standard was developed while the disaster events were either still in progress or had just finished. In §7.3, we relook into a part of the unverified tweets to check if some of them have later been verified. Table 1 contains the statistics.

TWEET VERIFICATION MODULE
Our tweet verification module infers a verification score corresponding to a given tweet. It is driven by the following three hypothesis: (a). Tweets posted during disasters differ significantly in the way they are expressed across the verified and unverified categories. The actual quantum of difference varies according to the type of content they convey, (b). User's pre-disaster tweeting behavior is indicative of her verified information posting tendency, and (c). The set of replies to a tweet provide us temporal signals, valuable for decoding the truth value of the tweet.
We first learn a latent space jointly capturing the type of content posted during disasters and the way in which that content is expressed ( §4.1). We next capture the pre-disaster behavior of users ( §4.2). Finally, the tweet and its set of replies are modeled using Tree-based LSTMs for learning the tweet verification task ( §4.3).

Content-Expression Topic Model
Earlier studies [16,17] have identified content classes for tweets posted during natural disasters (earthquake, flood) -'infrastructure damage', 'shelter & service', etc. Similar to natural disasters, we expect the man-made catastrophic events to attract tweets belonging to a finite number of content classes. Moreover, in the context of tweet verification, the way in which the content is presented and communicated is likely to determine the authenticity of the content. For example, if a tweet is tentatively structured, it is indicative of it being unverified. Figure 1 shows the distribution of a few tentative and definite words across verified & unverified classes. The figure shows that unverified tweets are more prone to use tentative words while verified tweets are more prone to use definite words (Differences are statistically significant -Welch's t-test, p < 0.05).
In order to exploit these differences between verified and unverified tweets in content and their means of expression, we learn a unique latent space which jointly captures both these aspects and for learning this latent space, we design a novel LDA-based generative model -Content-Expression Topic Model (CETM). We assume that, the major topics of interest during disasters revolve around a set of content words [35]. Content words constitute the terms which convey the key information present in the tweet. They comprise of -(i) Nouns: eg. police, terrorists, etc., (ii) Verbs: eg. killed, injured, etc., and (iii) Numerals. The set of content words in the case of disasters (similar across disasters & limited in number) are different in their characteristics when compared to generic events (vary across events & linearly growing), thus making their generalization easier [35]. Furthermore, we also want these topics to capture expression words present in the tweets. Expression words comprise of terms which depict the following psycho-linguistic characteristics -(i) Tentativeness: eg. probably, reported, maybe, (ii) Certainty: eg. confirmed, assure, must, (iii) Negation: eg. can't, isn't, neither, and (iv) Enquiring: eg. how, what, why, etc. They enable us to model the mechanism in which both the source tweets (tentative or certain) and the replies (enquiries or denial) are phrased. We present the details of CETM next.
4.1.1 Generative process of CETM:. Let T be the set of tweets, C v be the content-word set, E v be the expression-word set and J be the set of tweet categories (Tweet/Reply). We define each tweet While the motivation of using content words and expression words follows directly from our observations, using the tweet category helps us in distinguishing the source tweets from replies. Let K (c) be the set of content topics (describing content classes) and K (e) be the set of expression topics (describing the communication characteristics -expression & tweet category). A user who wants to post a tweet first chooses a content topic k c , and then selects an expression topic k e under k c to determine her communication mechanism for the content. The user then generates the set of content words W

Inferring CETM's parameters:
We use a collapsed Gibbs sampling approach for inferring the model parameters. The likelihood of generating t i from a content topic k c is given by: And, the likelihood of generating t i from an expression topic k e is: where n k c , n k e are the counts of tweets assigned to topics k c and k e respectively. n  . We check if CETM is able to learn distinctive topics encapsulating the desired content classes and their ways of expression. We use Twitie [3] for POS tagging tweets and utilize the POS tags for extracting content words. We obtain word lists of four psycho-linguistic characteristics using LIWC [31]. The words of a tweet are searched in these lists for extracting expression words. We initialize the Dirichlet priors using the well-established strategy [10] We set the number of content topics (|K (c) |) to 30 and number of expression topics (|K (e) |) to 10, the combination of which obtains the lowest perplexity value. We investigate the topics learned by CETM. We check the top words in each content-topic & expression-topic and try to assign them a content class and an expression class ( Table 2). We observe 4 major content classes:  Table 2 contains the top words in a few sample topics with their respective classes. Note that this work is one of the first in providing insights into the classes prevalent during man-made catastrophe which are much different from natural disaster classes. We derive a representation of each tweet in the latent space learned by CETM; the representation contains probability of that tweet belonging to the content-topics (describing content-classes) and the expression-topics (describing expression-classes). This representation is used as a feature in our tweet verification model ( §4.3).

Incorporating Pre-Disaster User Behavior
We next investigate user attributes relevant for tweet verification. 4.2.1 Static user attributes. Previous research works have mostly focused on certain Twitter-specific user attributes as an integral part of their tweet verification model [6,49]. We check the distribution of four heavily used attributes (follower count, ratio of follower and following count, age of user i.e., time elapsed since the user joined Twitter, and status count i.e., total number of tweets posted) over users who posted verified and unverified tweets respectively (henceforth referred as verified and unverified users). We perform Two-sample Kolmogorov-Smirnov test (ks2stat) to check whether the difference between verified and unverified users is statistically significant. We obtained ks2stat score 0.044, 0.028, 0.046, and 0.090 (p-value > 0.1) for the above-mentioned four user features respectively. It is clearly evident that differences are not statistically significant and these static user features are unlikely to contribute much to the model efficiency, especially in the case of disasters.

Pre-Disaster behavior of users.
We next inspect the behavior of users as observed on Twitter just before the time of disaster. For this, we first extract all the tweets posted by a user in the two months time range leading to the disaster using Twitter's Advanced Search functionality 2 . Similar to our tweet analysis, we check the degree of user's tentativeness & certainty in that time range based on the tweets extracted ( Figure 2). We observe that a larger percentage of unverified users use tentative words before the disaster as compared against the verified class users. On the other hand, larger percentage of users who predominantly post verified tweets during disasters make use of definite words. The above analysis indicates that user's pre-disaster behavior is an important determining factor which can make her post unverified information during disasters and we incorporate this behavior in our model. We analyze the degree of user's psycho-linguistic characteristics (Tentativeness, Certainty, Negation, & Enquiring) by computing how frequently the user uses a word belonging to each of these four classes. We obtain a word list of these four classes using LIWC [31](same as be the set of words in the tweets posted by user u, i days before the disaster. We define r (c) u , the regularity score of user u in class c, consisting of the set of words W (c) , as follows: i.e. the number of days user u has posted a word belonging to the class c in the 2 months time range. For each user u, we use regularity score of that user across the 4 psycho-linguistic classes (r (c) u ) which accounts for user's pre-disaster behavior. These 4 scores act as 4 features which are used, along with representations in CETM's latent space, in our tweet stream verification model. Please note that the topic model (CETM) we developed in the last section cannot be used here as the tweets posted in the examined two months time range may not be disaster-based.

Regularity score vis-a-vis verified-tweet-posting tendency.
We examine if the regularity score, obtained using pre-disaster tweeting behavior of users, represents their verified and unverified tweet posting tendencies. We find the distributions of regularity scores of users who posted unverified and verified tweets respectively to be significantly distinct (two-sample KS test statistic values of 0.2539 and 0.2731 (p-value < 0.001) for tentativeness and certainty classes respectively). The tweet verification model is described next.

Tweet verification using Tree-LSTMs
A source tweet which is to be verified, along with its set of replies forms a tree-like structure and we aim to preserve this structure in our model for effectively capturing the underlying nature of Twitter (generalizable to most social networks). For modeling treestructured network topologies, Tai et al. [39] introduced Tree-LSTM, an extension to the basic LSTM architecture, where each LSTM unit incorporates information from multiple child units. An example of Tree-LSTM network for the tree-structure of replies is shown in Figure 3a. Here, each node is an LSTM unit which takes as input: (i) The representation of the tweet in the generated latent space + regularity scores of the user who posted it (x i ), and (ii) The hidden states of its child nodes. It uses them to update its own input gate, output gate, forget gate, hidden state and memory cell values.
In our tweet verification task, we work under the simplified assumption that only the source tweets (and not replies) are needed to be verified. And thus, we wish to derive probabilities of the source tweet s being verified and unverified. While working with Tree LSTMs, this would correspond to computing verification score for the root node of each tree. At each root node s, we use a softmax classifier to obtain a probability distribution over verified and unverified classes, given the input {x } s observed in the tree with root node s. The classifier takes the hidden state h s at the root node s as input and computes the probabilities as follows: where W (s) and b (s) are the weight matrices and bias values respectively for the final tweet verification network,ŷ s is the predicted label (verified/unverified), & V (s) is the verification score of tweet with root node s. The cost function is the negative log-likelihood of the true labels y (s) at each root node, defined for m training instances as: where λ is an L2 regularization hyperparameter. The height of the tree would affect the time complexities of training (back-propagating to the leaves) and testing (forward pass from leaves to root). We inspect the average height of the trees formed, described by the levels at which replies are present. Figure 3b shows the CDF of reply levels. As can be observed, 80-90% of the replies are at level ≤ 5 which acts as a limiting factor on complexities.

Figure 3
label Source x s x 1 x 2 (a) Example of Tree-LSTM Network with source tweet as root.

VERIFIED SUMMARIZATION OF TWEETS
We describe our disaster-specific verified tweet-stream summary generation system next.

Filtering Non-Situational Content
Prior works [32,35] have shown that information posted on Twitter during disasters can be divided into two major classes -(i). Situational (information which provides updates about the current situation), and (ii). Non-Situational (sympathy or opinions of people). During disasters, end users like humanitarian organizations (OCHA, RedCross etc.) and government agencies are largely interested in situational updates and thus, in the context of crisis-specific summarization, summary of only tweets belonging to situational class are of primary importance. We classify all the tweets in our dataset into situational and non-situational classes and remove the tweets belonging to the non-situational class. We use the situational tweet classifier developed by Rudra et al. [35] for classifying tweets 3 . Table 3 shows the statistics of situational and non-situational content present in the verified and unverified tweets 4 . Out of the 1739 unverified tweets, 1425 tweets (around 82%) are situational. This would mean that 82% of the unverified tweets would be a part of input streams provided to existing summarization frameworks and thus, might get inadvertently included in the summary. We try to minimize this unverified content by making use of tweet verification scores in our summarization framework, which we describe next.

Proposed Summarization Framework
The current state-of-the-art in real-time unsupervised disasterspecific extractive summarization of tweet streams is the Integer Linear Programming (ILP) based system proposed by Rudra et al [35] which tries to maximize the coverage of content words (nouns, numerals, main verbs, locations) in the summary (COWTS). A summary of L words is achieved by optimizing an ILP objective function, whereby the highest scoring tweets are returned as the output of summarization. Moreover, duplicate tweets are removed from the summarization framework and weight of the content words are multiplied by the binary indicators. This, in turn, brings diversity in the summary by capturing different content words. In this paper, we try to increase the proportion of verified tweets contained in their summary by utilizing the verification scores obtained through Tree-LSTMs ( §4.3) in the objective function of COWTS. We multiply the indicator variable of each tweet with the verification score. The modified objective function for generating a summary of L words from n tweets having a total of m content words is: subject to the constraints where x i is the indicator variable of tweet i (1 if tweet i should be included in the summary, 0 otherwise), y j is the indicator variable for content word j, V (i) is the verification score of tweet i, Score(j) is the tf-idf score of content word j normalized between 0 to 1, T j is the set of tweets where content word j is present, and C i is the set of content words present in tweet i. γ v is a hyperparameter which controls the degree of verified content desired in the final summary. The objective function accounts for both, the likelihood of a tweet being verified (using V (i)) as well as the number of important content-words in the tweet (using Score(j)); γ v allows the system to trade-off between these two factors. The three constraints ensure consistency w.r.t. desired length of summary and the inclusion or exclusion of tweets & content words. We term this new model VERISUMM. We use GUROBI Optimizer [13] to solve the ILP. After solving, the set of tweets i with x i = 1, represent the summary.

Class-Regularized Verified Summaries
In §4.1.3, we discovered four content classes of tweets posted during man-made disasters -Affected Individuals, Investigations, Affected Regions, and Event-Specific. We may utilize the distribution of verified and unverified information over these content classes for improving the quality of the summaries. We devise a class-based verification requirement regularizer which takes into account the class-level insights in the summary generation process. Let c i denote the content-class i. Let N V c i & N U c i denote the number of verified & unverified tweets in the content-class i. Then, we compute the verification probability of a content class i, α c i , as: Next, we compute the class-level verification requirement quotient, β c i , as: where min is taken over the 4 content-classes. The verification requirement quotient is inversely proportional to the amount of verified content in each content class; its value will be high for classes having tweets which are more prone to being unverified. We use this requirement quotient, β c j , as a regularizer to γ v ; the regularizer gives a data-driven control over γ v . For the classes where number of verified tweets is high, the value of β c i will be low and thus, would decrease the value of γ v . Similarly, for the classes where number of unverified tweets is higher, β c i will be high, thus increasing the value of γ v (and consequently increasing the verified content proportion in the final summary). Using β c i , we modify the objective function of summary generation (Equation 8) as follows: where c j is the content class of tweet i. The constraints remain the same as Equation 9. We call this class-regularized verified summarization system -VERISUMM++. We will infer the α c i and β c i values for the four content classes in §7.2.

EVALUATING VERIFICATION MODULE
In this section, we evaluate the performance of our Tree-LSTM based tweet verification module. We also present certain statistics about different classes which we obtain using CETM.

Verified Tweet Detection Task
For training the Tree-LSTM model, we use a 128-dimensional single hidden layer at each LSTM unit, learninд_rate = 0.05, and batch_size = 50. We train the model for 500 epochs. We compare our model against three state-of-the-art models for verifiedtweet / fake-new detection: (i) CRF [49]: Employs Conditional Random Fields to learn from sequential tweet-user representations (word2vec for tweets & static twitter attributes for user), (ii) RNN [25]: Uses tweet clustering followed by RNN, tf.idf used for tweet representations, and (iii) CSI [33]: Integrates temporal patterns of tweets and representations of users computed using user's engagement graph. Moreover, we also evaluate the utility of different components of our model using the following variants: (i) LDA-TL: LDA [2] instead of CETM for tweet representations & Tree-LSTMs for modeling replies, (ii) CETM-RNN: tweet representations using CETM but employing a vanilla RNN for modeling replies, (iii) CETM-TL: tweet representations using CETM & Tree-LSTMs for modeling replies, (iv) CETM-RS-TL: user regularity score along with tweet representations using CETM as input features & Tree-LSTMs for modeling replies. Table 4 shows the accuracy and F1score values obtained by the baselines and our models on the four events. Each of the dataset has been tested by training on only all the other datasets thus emulating the real-world scenario where supervision for the ongoing disaster is limited. CETM based variants perform better than all the baselines (5-13% in terms of accuracy and 3-13% in terms of F1-score); they also perform better than the fake-news detection systems (similar gains). RNN performs better than CSI as it also uses tweet clustering as part of its model. LDA-TL performs significantly worse than all our remaining variants and a few baselines showing that Tree-LSTM alone cannot model the tweet verification task and needs rich tuned tweet representations as input. This is expected as we don't use a large dataset for training which hinders the automated learning of fruitful hidden representations. CETM-TL performs reasonably better than CETM-RNN indicating that modeling the inherent tree-like  structure formed by replies is important. CETM-RS-TL performs the best in most cases which shows the utility of user's pre-disaster behavior in the tweet verification task.

Time Needed for Efficient Detection
Detection of unverified information at an early stage is very important, especially during disasters, so as to prevent its rapid propagation. As our system is reliant on replies, it might so happen that we need to wait for a significant amount of time for efficient detection. We find that, for most of the tweets in our dataset, 60% of the replies are posted within 1 hour and 90% are posted within 5 hrs of the source tweet. This is reasonable considering most disaster applications including summarization take snapshots of twitter stream at time intervals of one hour [35]. We reassert the same by testing our system after setting a deadline on the detection algorithm, where all the replies to the source tweet subsequent to the deadline are considered unavailable. Table 5 shows the average F1-score of our system variants with increasing deadlines. CRF and RNN don't have mechanism to handle the replies, hence have the same performance throughout. Performance is marginally low at deadline = 0 hrs, i.e., when no replies are available. CETM-RS-TL has magnified gains over other variants at deadline of 0 hrs as compared to gains at higher deadline values. This indicates that user's pre-disaster behavior is critical when other signals (such as replies) are missing.
The performance values of all the variants reaches near saturation at the deadline of 1 hour; there are some local variations though which we attribute to noise in the replies.

Content-Class Identification: Distribution of Verified and Unverified Tweets
Using CETM, the tweets can be classified into the four contentclasses; we use the following approach for identifying the contentclasses. We manually mark each content-topic with a content-class (CETM identifies 30 content topics) and content class of a tweet is identified by its most probable content-topic. This classification of tweets into content-classes helps us in analyzing the distribution verified and unverified tweets have over these classes. Table 6 shows the class-wise distribution 5 . From Table 6 These observations indicate that the type of content helps in determining tweet verification likelihood. The observations are also used to generate high-quality summaries which we discuss in §7.2.

Analyzing the Errors
We study the errors committed by our verification module. We find some interesting patterns in how the error varies with contentclasses (last row of Table 6). The error tendency is higher for Affected Regions & Event-Specific classes and is lower for Affected Individuals & Investigations classes. Most of the tweets belonging to the Affected Individuals report disaster statistics (# of victims) -low error rate in this class suggests that our model is robust to small variations in reported numbers (eg. 50 v/s 51 injured). A large number of speculations revolve around Investigations (43.5% in our dataset); a random sample indicates that speculations in Investigations receive more denials than other classes which helps our model in detecting them. The tweets belonging to Affected Regions & Event-Specific contain a lot of proper nouns (eg. hospital names) which might be one of the reasons for poor performance numbers; we will work on course-correction for these classes based on future data.

EVALUATING SUMMARIZATION SYSTEM
We next describe the summarization baselines, evaluation metrics and discuss the performance of VERISUMM & VERISUMM++.

Performance of VERISUMM
7.1.1 Gold standard summaries. We create a gold-standard summary of 250 words for each event. We employ 3 volunteers working in the domain of disaster management. They individually prepare extractive summaries of the events. To generate the gold-standard summary from the 3 summaries, we first include tweets included in the summary of all the 3 volunteers, followed by the ones included by at least 2 until we achieve a summary of 250 words. 7.1.2 Quality of the summaries generated. We use the following three baselines for the summarization task: (i) APSAL [19]: Summarization using sentence salience prediction and affinity propagation based clustering approach, (ii) TSum4act [29]: Summarization of actionable and informative tweets, (iii) COWTS [35]: ILP-based summarization maximizing content-words. All the three baselines are disaster-specific. For each event, we generate summaries of length 250 words using our models as well as the baselines. We evaluate the quality of the summaries generated based on the three criterion described below: (1) Verified content proportion: We first compute the proportion of verified tweets in the summaries generated by VERISUMM (our model) at different γ v values (1, 2, 5, & 10) (refer Eq.8) compare it against the 3 baselines. Table 7 shows the variation of verified proportion for all the datasets. VERISUMM consistently generates summaries which contain significantly more verified content than the baselines (12%-48% gain over best-performing baseline). The verified content proportion increases on using higher γ v values.
(2) ROUGE-1 w.r.t. Ground-Truth summaries: We use the standard ROUGE [23] metric to measure the overlap of summaries generated by respective models with the ground-truth summaries (both 250 words). Due to the informal nature of tweets, we measure F-score of only ROUGE-1 variant. We compare the scores obtained by VERISUMM (for different γ v values) with the baselines. Table 7 contains the ROUGE-1 F-scores of the different models. We obtain significant gains over best-performing baselines (7.6% -13.5%). The gains decrease for higher values of γ v with γ v = 10 performing slightly worse than most baselines.
(3) Richness of the summaries: Finally, we check if the verified framework has any effect on the richness of the generated summary. We compute richness as the ratio of number of content words and the total number of words in the summary (where content words are as defined in §4.1). Table 7 contains the richness values for variants of VERISUMM as well as the baselines. We achieve richness values at par with baselines (-1.9% to 6.3% gain). 7.1.3 Discussion on the results. The above results clearly indicate that VERISUMM outperforms the current state-of-the-art disasterspecific summarization systems in terms of generating verified content. Furthermore, at the same time, it is able to maintain high scores on the other important quality measures (richness & ROUGE score). The hyperparamater γ v acts as a trade-off between these three quality metrics (high γ v =⇒ higher verified content but lower richness & ROUGE scores). We observe that a balance between these three metrics can be maintained by choosing γ v close to 5 at which VERISUMM generates summaries which are highly verified and superior to the state-of-the-art in terms of richness & ROUGE scores. In §7.2, we take insights from the discovered content-classes in order to improve ROUGE-1 scores & richness while maintaining similar verified proportions.

Improvements using Class-Level Insights:
Evaluating VERISUMM++ We now evaluate the effect of incorporating class-level insights, as described in §5.3, on the summaries. We infer the α c i and β c i values for the four content classes using the distribution of verified and unverified content presented in Table 6.  (5)) in Table 7. It shows the verified proportion, ROUGE-1, and richness values. VERISUMM++ in almost all the cases improves the verified proportion but more importantly helps us in attaining improved ROUGE-1 scores (0%-8% gain) and richness values (0%-3% gain). Table 1 reports the proportion of verified tweets with respect to the gold standard dataset -the dataset however is limited by the time of its generation; the unverified tweets might have become verified over time. Hence, we manually tried to relabel the unverified tweets. One way [27] of figuring out the current labels is by using rumor debunking sites such as scopes.com. However, their coverage for disaster-related tweet is not high, especially the ones used in our study. Hence, we explore a list of 10 news sources (BBC, New York Times, CNN, Guardian, The Washington Post, ABC News, Sky News, Fox News, 9News for Sydney, & CBC for Ottawa) to label the unverified tweets. We collect articles, related to the unverified information, posted by any one of the 10 news sources. If at least one article confirms the information by either reporting first-hand experiences of users or by verifying it after having reported it as unconfirmed, we change the label of the tweet to verified. Note that small variations in reported numbers are ignored (eg. 11 casualties v/s 12 casualties). However, if any of the articles refute the posted information, we don't change the label. Also, if all the articles tentatively put their claims regarding the information ('unconfirmed', 'not verified', 'as per unknown sources'), the label is not changed. After getting the eventual labels of the unverified tweets in the summaries, we can compute the eventual verified proportions. The eventual verified proportion of a summary is given by the number of tweets in the summary which were verified (by gold standard) + tweets which were initially unverified but got re-labeled as verified divided by the total number of tweets in the summary. Table 8 reports the values of the eventual verified proportions of our model variants and baselines. We perform 5% -21% better than the baselines in eventual verified proportion. More significantly, we find that VERISUMM (1) outweighs the closest baseline (COWTS) in the proportion of eventually verified articles although it was behind when result was obtained only on gold standard data (check third and fourth rows of Table 7). This implies that even though a significant proportion of summary tweets generated by some  variants of VERISUMM (1) may be unverified as per initial analysis (done during or just after the disaster), the probability of those tweets being ultimately found authentic is higher for VERISUMM than for COWTS. Moreover, the proportions for γ v = 5 and γ v = 10 are statistically indistinguishable. This reiterates the fact that summarization model performs the best for value of γ v close to 5.

Representative Summaries
We summarize the difference in output of our final system -VERISUMM++ and the most competitive baseline COWTS through illustration of (part of) the summaries produced by them. In general, VERISUMM++ captures tweets having high verification scores compared to COWTS. Specifically, we observe the following four patterns in these summaries as highlighted in Figure 4 -(T1). a number of verified tweets are captured by both VERISUMM++ and COWTS, (T2). some unverified tweets are retrieved by both systems, (T3). there are unverified tweets present in the summary of COWTS which are not shortlisted in VERISUMM++; they are replaced by suitable verified tweets in VERISUMM++. (T4). some tweets in summaries initially stay unverified but in VERISUMM++ eventually get verified (when re-annotated as per §7.3); most of the unverified tweets continue to stay unverified in COWTS. Case Study of the 2019 Sri Lankan Easter Attacks 6 : To further analyze the robustness of our verification-summarization framework, we present an interesting case study of the recent Sri Lankan Attacks. Immediately following the attacks, a Sri Lankan minister tweeted that a foreign intelligence report predicting the attacks was noted to some officials few days before the attack. Some people on Twitter questioned the authenticity of this tweet while others started speculating the names of Sri Lankan officials who were aware/unaware of this report; the names included President & Prime Minister. Both the original tweet by the minister and the subsequent speculative tweets were initially unverified as there was no supporting data to authenticate them. Two days later, both the President & the PM denied being informed about the report but verified that the report was known to few security officers. This meant that the basic content of minister's tweet was true (unverified initially, eventually true) but most of the subsequent tweets were false -The ideal summary would not include these tweets. We generate and analyze summaries of the 2019 Sri Lankan Attacks using VERISUMM++ & COWTS. The unverified information related to the speculations around intelligence reports is part of COWTS's summary but not that of VERISUMM++.

CONCLUSION & FUTURE CHALLENGES
To the best of our knowledge, this is the first work on generating verified summaries of tweet streams during disasters. The simple but novel content-expression topic model (CETM) which simultaneously incorporates tweet's content and its way of expression for creating tweet representations is at the core of the innovation.
In the process, we discovered four content classes of information posted during man-made disasters. The tweet representations and pre-disaster user behavior (regularity scores) were used to train a Tree-based LSTM model, with an objective of inferring tweet verification probabilities. The verification scores and the information content and the class information of the tweets were used in an ILP framework for generating the desired verified summaries. As expected, our summaries contained exceptionally high proportion of verified information; but more interestingly the summaries also had better ROUGE-1 scores and richness than the state-of-the-art. Also, the proportion of eventually verified tweets included in the summary (which cannot be verified at the time of usage) is much higher than competing techniques. We believe the technique developed in this paper has wider implication and usage. The technique can potentially be used during natural disasters, epidemics; can be personalized according to stakeholders requirement -we would explore those possibilities as one of our immediate future works.