Vector Space Modeling Based Evaluation of Automatically Generated Text Summaries

Evaluating automatically generated summaries is not an effortless task. Despite the fact that significant advances have been made in this context during the last two decades, it still remains a challenging research problem. In this paper, we present VSMbM; a new metric for automatically generated text summaries evaluation. VSMbM is based on vector space modelling. It gives insights on to which extent retention and fidelity are met in the generated summaries. Three variants of the proposed metric, namely PCA-VSMbM, ISOMAP-VSMbM and tSNE-VSMbM are tested and compared to Recall-Oriented Understudy for Gisting Evaluation (ROUGE): a standard metric used to evaluate automatically generated summaries. Conducted experiments on the Timeline 17 data set show that VSMbM scores are highly correlated to the state-of-the-art Rouge ones.


INTRODUCTION 1.Automatic Text Summarization
Abstracts are common, and their use has been adopted to the daily running of affairs.According to [1], paper abstracts, book reviews, headlines on TV news, movie trailers and shopping guidelines on online stores are some of the examples of summaries that we have to interact with on a daily basis.A summary has commonly been defined as 'a text produced from one or more texts with an intention of passing on key information from the original script and is usually less than the original version' [2].Notwithstanding the use of the word 'text', summaries too apply to other forms of media including audio, hypertext and video.The special case of Automatic text summarization (ATS) refers the process of creating a short, accurate, and fluent summary from a longer source text [3].
Following developments in technology, huge amounts of text resources are available at any one's discretion.This calls for automatic text summarization so that users can access only relevant information they are looking for.[4] argues that automatic summarization has issues worth of address despite having been around for more than five decades.Also, it identifies six main justifications why we need automatic text summarization.The first reason is that summaries reduce the amount of time that one would have spent reading a longer document.They make it possible to consume content faster and more effectively.Second, automatic summaries make the selection process easier when researching documents.Automatic summarization can also make the process of indexing text more effective.Next, these approaches also make it possible to prepare summaries that are fairer compared to those prepared by humans.Summaries generated automatically contain a lot of personalized information, which can be a useful addition to question-answering systems.Lastly, we need these processes to increase the number of texts that can be processed by commercial abstract services.
There isn't any indicated scientific classification or arrangement of summary types.Indeed, the arrangement of the types of summary changes is dependent on the angle of perception.[1], introduced nine parameters to find out the various classifications of summaries.One is a parameter based on relationship, and in this case, summary can be considered as either an extract or an abstract.Extractive summation here implies that the most significant parts are combined together from the original text minus any modification to the text selected.On the other hand, abstractive summarizations imply that the significant issues in the original format are paraphrased and presented in a grammatical way to produce a summary that is more coherent.Additionally, considering the readership parameters, the process of summarization can lead to production of generic summaries that is if it depends on the original documents which might have been produced from query driven summaries and this has an interest on getting information that is related to the query.Then there is a span parameter that categorizes the summarization process into one document from a number of documents.Language is one parameter that is considered very important.The language parameter is divided into the monolingual parameter which summarizes documents presented in one language and multi-lingual of cross lingual which presents a summary of texts presented in more than one language.
[5] have pointed out key challenges associated with automatically generated summaries evaluation which is an open subject in text summarization.In the next two section, we make a short state of the art of most relevant proposed evaluation protocols for automatically generated text summarization and we present key features which make the originality of our work.

Related Work
Evaluating automatically generated summaries is not an effortless task.In the last two decades, significant advances have been made in this research field.Therefore, various evaluation measures have been proposed.SUMMAC [6], DUC (Document Understanding Conference) [7] and TAC (Text Analysis Conference) [8] are the main evaluation campaigns led since 1996.Note that the evaluation process can be led either in reference to some ideal models or without reference [9].ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the most used metric for automatically generated abstracts evaluation.Summaries are compared to a reference or a set of references (human-produced summaries) [10].Note that there are five variants of the ROUGE metric: 1) ROUGE-N [11]: it captures the overlap of N-grams between the system and reference summaries, 2) ROUGE-L [12]: it gives statistics about the Longest Common Subsequence (LCS), 3) ROUGE-W: a set of weighted LCS-based statistics that favors consecutive LCSes, 4) ROUGE-S [10]: a set of Skip-bigram (any pair of words in their sentence order) based co-occurrence statistics.COVERAGE is another metric which has been used in DUC evaluations.It gives an idea on to which extent peer summary conveys the same information as a model summary [14].RESPONSIVENESS has also been used in focused-based summarization tasks of DUC and TAC evaluation campaigns [14].It ranks summaries in a 5point scale indicating how well the summary satisfied a set of needed information criteria.The pyramid evaluation approach uses Summarization Content Units (SCUs) to calculate a bunch of weighted scores [15].A summary containing units with higher weights will be affected a high pyramid score.A SCU has a higher weight if it appears frequently in human-generated summaries.Fresa is another metric [16].It is the state-of-the-art technique for evaluating Electronic copy available at: https://ssrn.com/abstract=3656960automatically generated summaries without using a set of human-produced reference summaries.It computes a variety of divergences among probability distributions.Recently, [17] proposed a new implementation of the ROUGE protocol without human-built model summaries.The new summary evaluation model (ASHuR) extracts most informative sentences of the original text based on a bunch of criteria: the frequency of concepts, the presence of cue-words, sentence length, etc.Then, the extracted set of sentences will be considered as the model summary.[18] gives an overview of challenging issues related to summary evaluation

Originality of our work
Aautomatically generated summaries should satisfy three criteria: 1) Retention: It is a measure of how much the generated summary reports salient topics present in the original text, 2) Fidelity: Does the summary accurately reflect the author's point of view?and, 3) Coherence: To which extent, the generated extract is semantically meaningful?Most of the described metrics in the above sub-section only focus on the overlap of N-grams between the original text and the generated summary.In other words, they reflect the coverage ratio meanwhile they don't give insights on to which extent fidelity is met, i.e. if a long source text contains six concepts and a given summary focuses on the four last most important ones, it will be assigned a higher score than another summary focusing on the most important two concepts present in the original text.In this case retention is met.However, it is not the case for the fidelity criterion.
In this paper we present a new vector space modelling-based metric for automatic text summaries evaluation.The proposed protocol gives insights on to which extent both retention and fidelity are met.We assume that fidelity is met if we assign higher weights to text units related to most important concepts reported in the original text.The next section describes technical and mathematical details of the proposed metric.The third one describes conducted experiments and obtained results.Conclusion and future work are exposed in the fourth section.

VECTOR SPACE MODELLING BASED METRIC (VSMBM) FOR AUTOMATICALLY GENERATED TEXT SUMMARIES EVALUATION
From a computational point of view, the main idea is to project the original text onto a lower dimensional space that captures the essence of concepts present in it.Unitary vectors of the latter space are used to compute the three variants of the proposed VSMbM metric namely PCA-VSMbM,ISOMAP-VSMbM andtSNE-VSMbM.Mathematical and implementation details of the proposed metric will be expanded in the coming three subsections.

The PCA-VSMbM
First, source text is segmented onto m sentences.Then a dictionary of all nouns is constructed and filtered in order to remove all generic nouns.Text is then represented by an m × z matrix, where m is the number of segments and  is the number of unique tokens.Next the conceptual space is being constructed.It will be used later to compute the PCA-VSMbM metric.

Construction of the conceptual space
Each sentence   is represented by a column vectorζ  .ζ  is a vector of Z components.Each component represents the tf-idf of a given word.Afterwards, mean concept vector  is computed as follows: Electronic copy available at: https://ssrn.com/abstract=3656960 Note that each ζ  should be normalized to get rid of redundant information.This is performed by subtracting the mean concept: In the next step, the covariance matrix is computed as follows: Where Each projected sentence onto the conceptual space is represented as a linear combination of  eigenconcepts: Where  Θ  ()=   Θ  is a vector providing coordinates of the projected sentence in the conceptual space.

Computation of the PCA-VSMbM score
The goal here is to find out to which extent selected sentences to be part of the generated summary are expressing the main concepts of the original text.Thus, each vector ζ  representing a given sentence   is normalized by subtracting the mean concept ∶   =   −  .Then it is projected onto the newly constructed conceptual space: Electronic copy available at: https://ssrn.com/abstract=3656960 Θ   = ∑  Θ  ()   (7) Next, the Euclidean distance between a given concept  and any projected sentence is defined and computed as follows: Next, Retention-Fidelity matrix is constructed as follows: First, we fix a window size .In the bellow example,  is set to 4. The first line gives the index of the four sentences having the smallest distances to the vector encoding the first most important concept.The second line gives the same information related to the second most important concept.Also, the order of a given sentence in each window  depends on its distance to a given concept.For instance, the first sentence is the best one to encode the first most important concept while the 8 th sentence is the last best one to encode the same concept in a window of four sentences.
Next, the Retention score of each sentence being projected in the conceptual space is defined as follows: it's equal to the number of times it occurs in a window of size  when taking in consideration the most important  concepts.The main intuition behind it, is that a given sentence having a height Retention sore should encode as much as possible the  most important concepts expressed in the original text.
= 1 if the sentence  occurs in the i th window.If not, it is equal to zero.Now, the PCA-VSMbM score is defined as shown in the tenth equation as the averaged sum of the retention coefficients of summary sentences.Note that every retention coefficient is weighted according to the sentence's position in a given window of size  .The main intuition behind it is that, single units (sentences) of a given summary whose PCA-VSMbM score is high should encode the most important concepts expressed in the original text.So, they should have minimal distances   (Θ  ) = ‖Θ  − Θ   ‖ in equation 8.In other words, the PCA-VSMbM score gives insights on to which extent extracted sentences encode concepts present in the original text while taking in consideration the importance degree of each concept Electronic copy available at: https://ssrn.com/abstract=3656960 is the number of extracted sentences to construct the summary,   = 1 if a sentence  occurs in the i th window.If not, it is equal to zero.  is the rank of in the i th window.

The ISOMAP-VSMbM
In the ISOMAP-VSMbM, we rather use the geodesic distance.The ISOMAP-VSMbM approach consists in constructing a k-nearest neighbor graph on  data points each one representing a sentence in the original space.Then, we compute the shortest path between all points as an estimation of geodesic distance   .Finally, we compute the decomposition  in order to construct Ξ  previously defined in equation 5 where: is acentering matrix;  =  1    and  = [1,1, … ,1]  .T is an  × 1 matrix.Note that the decomposition of  is not always possible in the sense that there is no guarantee that  is a positive semidefinite matrix.We deal with this case by finding out the closest positive semidefinite matrix to .Then we decompose it.Next, we proceed the same way we proceeded previously with PCA-VSMbM.ISOMAP-VSMbM is defined as PCA-VSMbM in equation 10.

The tSNE-VSMbM
At the begenning, we proceed the same way as PCA-VSMBM to constructthe set of ζ 1 , ζ 2 , … , ζ  feature vectors decribing the sentences of the text to summarize.Then, we construct a feature matrix whose lines are made up by the ζ  feature vectors (1⩽  ⩽ ).columns of the feature matrix are 1 ,  2 , … ,   where   is a word feature vector and  is the number of unique words used in the the text.tSNE-VSMBMfirst computes probabilities   that are proportional to the similarity of words  and   for  ≠  as follows: Note that the similarity of word  to word  is the conditional probability  | , that, word   would be among word   's neighbours if neighbors were chosen based on their probability density under a Gaussian distribution centered at   [19].
Moreover, the probabilities with  =  are set to zero (  = 0).The bisection approach is used to set the bandwidth of the Gaussian kernels   thus and thus the perplexity of the conditional distribution equals a predefined perplexity.Therefore, the bandwidth is adapted to the density of the word feature vectors: In other words, smaller values of Gaussian kernels  are used in denser parts of the word feature vectors space.
Note that the Gaussian kernel is highly sensitive to dimensionality since it uses the Euclidian distance.It means that the   would asymptotically converge to a constant when we deal with long texts.In other words, they become similar.Thus, a power transform, based on the intrinsic dimension of each word feature vector is used to adjust the distances [19].
Electronic copy available at: https://ssrn.com/abstract=3656960tSNE-VSMBMisbased on the t-distributed stochastic neighbor embedding technique to construct the conceptual space of equation 5.The latter approachconstructs a dimensional map  1 ,  2 , … ,   (with  ∈   ) that reflects perfectly the similarities   by measuring similarities   between two word feature vectors in the map   and   for  ≠ , as follows: For  = ,   = 0.In order to allow dissimilar word feature vectors to be modeled far apart in the map, a Cauchy distribution (a kind of Student t-distribution with one-degree of freedom) is used to measure similarities between low-dimensional word feature vectors.Thus, locations of word feature vectors  in the map are obtained by minimizing the Kullback-Leibler divergence of the distributionfrom the distribution as follows: The gradient descent approach is used tominimizee the aboveKullback-Leibler divergence.The result of this optimization is a map that reflects the similarities between the high-dimensional word feature vectors.Now, constructed  vectors will be set as unitary vectors of the Ξ  conceptual space of equation 5.
Once, the conceptuel space is constructed,we proceed the same way we proceeded previously with PCA-VSMbM.tSNE-VSMbM is defined as PCA-VSMbM in equation 10.

Dataset
The Timeline17 dataset is used for experiments [20].It consists of 17 manually created timelines and their associated news articles.They mainly belong to 9 broad topics: BP Oil Spill, Michael Jackson Death (Dr.Murray Trial), Haiti Earthquake, H1N1 (Influenza), Financial Crisis, Syrian Crisis, Libyan War, Iraq War, Egyptian Protest.Original articles belong to news agencies, such as BBC, Guardian, CNN, Fox news, NBC News, etc.The contents of these news are inplain text file format and noise filtered.

Results and discussion
In order to evaluate the proposed metric, we compute the Pearson's correlation between VSMbM and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) sores.Note that Pearson's correlation coefficient measures the statistical correlation, between two signals.Thus, we assume that all the computed scores with a given evaluation approach constitute a signal.Then, we compare obtained averaged Rouge-1 and PCA/ISOMAP/t-SNE-VSMbM scores when using both human-made and automatically generated summaries [21] [22].Results of the described above experiments are reported in Table 1 and Table 2.
Electronic copy available at: https://ssrn.com/abstract=3656960 = [Θ 1 , … , Θ  ].Note that  in (3) is a  ×  matrix and  is a  ×  matrix.Eigen concepts are the eigenvectors of the covariance matrix.They are obtained by performing a singular value decomposition of : Where dimensions of matrix , and  are respectively  × ,  ×  and  × .Also,  and  are orthogonal (  =    =   and   =    =   ).In addition to that;  Columns of  are eigenvectors of   .Columns of  are eigenvectors  .Squares of singular values   of  are eigenvalues λ  of   and   .Note that  < .So, eigenvalues   of   are equal to zero when  >  and their associated eigenvectors are not necessary.So, matrix  and  can be truncated, and, dimensions of ,  and  in (4) become respectively  × ,  ×  and  × .Next, conceptual space is being constructed by  eigenvectors associated to the highest  eigenvalues: