Extractive multi document summarization using harmony search algorithm

ABSTRACT


INTRODUCTION
Information overload is one of the most common problems because of the fast evolution of information in the World Wide Web [1].Text summarization (TS) is the dismissal of such a problem.Also, TS, the process of producing a document summary from a series of documents or one document without losing its main ideas, aims to extract useful information from the sources to the users [2].The summary offers a helpful guide to generate attention on information, to make decisions on whether a document is useful or not and to assist as a time saver for users [3].Based on the quantity of the document to be summarized, TS can be classified as a single document summarization (SDS) or multi-document summarization (MDS).For instance, in SDS just a single document can be summarized into shorter ones, while in MDS a set of related documents with the same topic is summarized into one summary [4].MDS is more complicated than SDS although some similar techniques can be used for both MSD and SDS due to information overload and a high degree of redundancy.The redundancy occurs because summarized documents deal with similar topics and share the same ideas.As a result, reducing redundancy can lead to a high-quality summary [5].
The way of a summary creating is either extraction or an abstraction according to the function to be performed [6].Extractive summarization is a mechanism for a professional extraction of the literary components like sentences, passages, and so on from the original meaning.Whereas, abstractive summarization will depend on the natural language processing (NLP) techniques, which need a complicated understanding of NLP strategies to analyze the sentences of documents and paragraphs where several changes have to be made on the selected sentences.While in the extractive summarization, no need for modification will be applied to the sentences that are included in the resulted summary.Therefore, abstractive summarization is time-consuming and much difficult than extractive summarization [7].Moreover, summarization can be classified as either generic summarization or query summarization.The Generic summarization generates a summary, which always includes the essential content of the documents.However, the restriction of generic summarization is never a topic or query is available for the summarization procedure.While in query-based summarization, a summary is created depending on the query of the user, where the documents are searched to be matched with such query [8].This paper approach supposes a new model for extractive generic MDS based on harmony search algorithm (HSA) that improves coverage, diversity, and readability.The experimental results utilized to the TAC-2011 dataset and ROUGE package applied to measure the performance of the model.

RELATED WORKS
Even though text summarization has drawn consideration basically after the information expansion on the Internet, the primary work has been done in 1958 [9].From that year a variety of summarization techniques have been proposed and assessed.For example, some researchers [10,11] applied sentence clustering in text summarization successfully.The basic idea behind the cluster-based approach for MDS is based on sentences with high degrees of similarities that are grouped into one cluster, then one sentence is selected from each cluster to be included in the generated summary.Sentence selection depends on selecting sentences that are closest to the centroid of the cluster [12].Graph-based approaches, which are based on an assumption that the sentence importance will increase if it has more similarity to other sentences in the document, are also used widely in MDS by the researcher.The process begins by representing each sentence as a node in the graph and the cosine similarity can be used as an edge between nodes [13].
The page rank [14] or text rank [15] is then applied to score the sentences, sentences with high scores are included in the final summary.Some researchers focus on machine learning approaches which have been commonly used in the field of TS.This approach depends on categorizing the sentences into two classes; summary sentences or non-summary sentences.Such an approach requires dividing the dataset into training, testing that data for labeling, and categorizing it accordingly.Some of the machine learning approaches are Naïve Bayes [16], neural network [17], decision trees [18] and support vector machine [19].Many researchers have also investigated optimization approaches.Many optimization techniques such as differential evolution (DE) [20], particle swarm optimization (PSO) [21] and genetic algorithm (GA) [22] are used for TS.Optimization techniques are based on multiple agents in the population that search for candidate solutions which are considered as points in the search space.In [23] authors applied a bee colony for MDS.Here, the bees were considered as agents that search for nectar in flowers where the food is considered as a candidate solution, there is a single bee for every food source.As a result, the objective function is for the bees to collect a portion of food.When the food is abandoned, the bee then turns into a scout and looks for another food source.They search for neighboring areas and select the best candidate.When moving to a neighbor, a sentence is deleted randomly from the present summary and another sentence is included so the length limitation is not violated.

PROPOSED FRAMEWORK
In this paper, a new approach for MDS is proposed.It is decomposed of four main steps.First the preprocessing is done.Secondly, word similarity measure and summary quality factors are applied, and finally, harmony search is performed.These four steps are described as follows;

Preprocessing
There are four steps for preparing the data, these steps include: -Sentence segmentation: each document is divided individually into several sentences based on the dot between them.-Tokenization: the process of separating sentences into terms.
-Stop word removal: involves removing redundant and repeated terms in the document that do not offer the required information for recognizing an important sense of the document.-Stemming: the process of generating the root of the word.


Extractive multi document summarization using harmony search algorithm (Zuhair Hussein Ali) 91

Similarity measure
Similarity measure plays a significant role in the field of text mining [20].To compute the similarity between each term, they must be represented as a vector.The well-known representation scheme for terms units is the vector space model (VSM).Let T= {t1, t2,..,tp} represent the distinct terms that exist in the document collection D, where p is the number of terms in D. Through VSM every sentence (si) is represented using these terms as a vector in n-dimensional space, si= {wi,1, wi,2,…,wi,p}, for i=1 to p.Each element in the vector represent a term within a given sentence.The value of each element in the vector assigns a weight using term frequency-inverse-sentence-frequency as explained shown in (1) [24].
where: TFi, k is the term frequency, represents how many term k appears in the sentence (Si).n. the number of sentences in D. nk. the number of sentences in which term   appears.The weight  , of the term   should be zero if it does not exist in the sentence Si.
The VSM requires high dimensionality of feature space that affects the performance of TS.Depending on the number of terms in each sentence the specified vector dimension p is very large and has numerous null elements, which can be a major disadvantage of VSM.The center of the document collection (o) can be calculated as the average of weights  , of term tk for all Si in the document collection as shown in (2) [25]. (2)

Summary of quality factors
In this section, the important factors for summary quality are demonstrated.That consists of coverage, diversity and readability.Each factor plays important role in the summarization process.These factors are described as below:

Coverage
The goal of TS is to cover the main content of the summarized documents by choosing subset S  D that covers as many conceptual sentences as possible.Summary coverage can be calculated by measuring the cosine similarity between the center of document collection (O) and each sentence (Si) as shown in (3).
The similarity between the center of document collection and each sentence decides the importance of the sentence and whether it is included in the generated summary [26].

Diversity
A summary that has a high diversity between its sentences can be considered as a good summary because its sentences solve the problem of information redundancy that occurs in most summarization models, especially in MDS.Thus, to achieve an adequate summary, the sentences should have a high diversity among them.Summary diversity is computed by considering the total value of sentence similarity.A good summary is associated with lower diversity values that ensure minimum information redundancy.As shows in (4) the formulation to compute sentences diversity [27].

Readability
Readability is an important factor for document summary that indicates the sentences in the summary are highly related to the next sentence in the document summary.The readability (Rs) of summary (s) with length (S) can be formulated as shown in ( 5) and ( 6) respectively [28].
The objective function is to maximize the three factors coverage, diversity and readability as shown in (7).

Harmony search based MDS
Harmony search algorithm (HSA) is a meta-heuristic algorithm that was developed by Z. W. Green, et al. in 2001 [29].HSA requires less mathematical operations and can be easily used in many optimization problems compared to other meta-heuristic algorithms.HSA algorithm tries to search for a global solution specified by the objective function.The decision variables assign values to determine the objective function, is similar to tones of musical instruments that decide the aesthetic quality.Thus, the HSA algorithm works similarly to a musician who is looking for the best harmony [30].
The harmony vector values are stored in the harmony memory (HM) matrix as follows; where [x1 i , x2 i ,.., xn i ] is a candidate solution.The HM is initialized by random variables.Also, two important parameters that should be initialized are Harmony memory considering rate (HMCR) and pitch adjusting rate (PAR).These two parameters are updated by harmony memory consideration (HMC) and pitch adjusting (PA) respectively.The HMCR plays an important role in selecting a value from memory while the PA is important for both exploitation and exploration.The exploitation is used to find optimal solutions, whereas exploration is used to avoid local minima [31].The following algorithm shows how HSA is used for text summarization.a. if rand (0.1) < HMCR then choose new solution from HM else choose a solution randomly.b. if rand (0.1) < PAR then choosing an adjacent value of the selected value to depend on bandwidth.-Step9: if the new solution is better than worst stored {based shown in (7)} one then update the HM by the new solution Else eliminate the new solution -Step10: check stop condition if the result be in a stable state then end Else go to step 7

DATASET AND EVALUATION METRICS
TAC-2011 dataset was used to test the system performance.The dataset consists of seven languages (English, Arabic, Greek, Czech, French, Hindi, Hebrew).There are10 topics, each of 10 documents for each language [32].The proposed model deals with the English language only.Recall-oriented understudy for gisting evaluation (ROUGE) [33] was used to evaluate the proposed system the outputs of a rouge package are three numbers which represent Precision, Recall, and F−score.They are formulated as follows.

RESULTS AND DISCUSSION
ROUGE-1 and ROUGE-2 matrices have been used to measure the performance of the summary.These matrices are very similar to human judgment.The summary performance is measured by computing the overlap between system summaries with human summaries.ROUGE-1is concerned with computing unigrams overlaps while ROUGE-2 is concerned with computing bigrams overlaps.The results are compared with the results of [12] that included peer summaries in the TAC-2011 data set.Tables from 1 to 4 show the results of the proposed model [12] using ROUGE-1 and ROUGE-2 respectively.
As seen from Tables 1 and 2, compared to the result of [12], using ROUGE-1.The tables show the recall and F-score of the proposed model are higher.However, the precision is lower.The Judgment between the recall and the precision is the F-score that consider them both.As known, the precision is computed by dividing the number of sentences overlap between system summary and ideal summary by the number of sentences in the system summary.Whereas the recall is computed by dividing the number of sentences overlap between system summary and ideal summary by the number of sentences in the ideal summary.Thus, by increasing the number of words in the system summary leads to decreasing precision.While decreasing the number of words in the system summary leads to decreasing the recall.The length of each ideal summary between 240-250 words, while the length of each individual generated summary is more than 250, the reason behind the length of the generated summary is the mechanism of creation that is based on adding sentences to the summary without any change to the length of the sentence.Which causes the summary length of more than 250 words, especially when the last sentence is too long.Tables 3 and 4 show the results of the proposed model using ROUGE-2.The efficiency of the proposed model was evident when it was used ROUGE-2 because ROUGE-2 is closer to human summary than ROUGE-1.The Average of the three metrics recall, precision and F-score are better than [12].This is because of the good definitions of coverage, diversity, and readability and due to the good performance of HSA in regards to choosing the most suitable sentences to be included in the final summary.

CONCLUSION
The need for influential MDS approaches to extract significant information from a document collection becomes of necessity.This paper used HSA based MDS to create a generic extractive summary.The Summarizer used a benchmark dataset called TAC-2011, and ROUGE package was applied to evaluate the performance of the summarizer.The proposed model is based on three important issues in MDS that include coverage, diversity, and readability.Good results were obtained from the proposed model.The limitation of this method is controlling the parameters of HMCR and PAR that require special treatment.


ISSN: 1693-6930 TELKOMNIKA Telecommun Comput El Control, Vol. 19, No. 1, February 2021: 89 -95 90 -Step1: collect a set of multiple documents D= {D1, D2,..,DN} where each Di represent individual document -Step2: apply preprocessing steps to each Di -Step3: for each Di calculate the coverage as shown in (3) -Step4: for each Di calculate the diversity as shown in (4) -Step5: for each Di calculate the readability as shown in (6) -Step6: initialize the HM with random solutions and also initialize HMCR, PAR -Step7: sort the entire solution of HM and rank them according to shown in (7) -Step8: improves a new solution from HM as follows;


Extractive multi document summarization using harmony search algorithm (Zuhair Hussein Ali) 93

Table 2 .
Average precision, recall and F-score using ROUGE-1

Table 4 .
Average precision, recall and F-score using ROUGE-2