Cluster-Based Information Retrieval by using (K-means)- Hierarchical Parallel Genetic Algorithms Approach

Cluster-based information retrieval is one of the Information retrieval(IR) tools that organize, extract features and categorize the web documents according to their similarity. Unlike traditional approaches, cluster-based IR is fast in processing large datasets of document. To improve the quality of retrieved documents, increase the efficiency of IR and reduce irrelevant documents from user search. in this paper, we proposed a (K-means) - Hierarchical Parallel Genetic Algorithms Approach (HPGA) that combines the K-means clustering algorithm with hybrid PG of multi-deme and master/slave PG algorithms. K-means uses to cluster the population to k subpopulations then take most clusters relevant to the query to manipulate in a parallel way by the two levels of genetic parallelism, thus, irrelevant documents will not be included in subpopulations, as a way to improve the quality of results. Three common datasets (NLP, CISI, and CACM) are used to compute the recall, precision, and F-measure averages. Finally, we compared the precision values of three datasets with Genetic-IR and classic-IR. The proposed approach precision improvements with IR-GA were 45% in the CACM, 27% in the CISI, and 25% in the NLP. While, by comparing with Classic-IR, (k-means)-HPGA got 47% in CACM, 28% in CISI, and 34% in NLP.


INTRODUCTION
In the recent years, the information has been overloaded because of the rapid growth of the web.To deal with this information a Web Document Information Retrieval task is used to retrieve the most relevant documents to a user query [1,2].Information retrieval needs to scan all documents that are found in a database, then give scores according to a relevance degree to the user query, then rank all results and present them to the user [3,4].Thus, information retrieval requires long runtime to scan all documents.The cluster analysis tool plays a basic role in information retrieval to improve the Information Retrieval performance by reducing the search time and to prevent irrelevant results from the retrieved documents.The idea behind the web document clustering is to assign a dataset of web documents to a set of clusters that depend on the similarity's degree among them.Therefore, it becomes easy for search engines to query in the same cluster if each web page is assigned to a similar group [5,6].
An efficient clustering algorithm and genetic algorithm should represent a document as structured data using the document representation model.The most common aspect used in document representation is the Vector Space Model (VSM) [7].Besides, a similarity degree between two documents or clusters should be measured by using one of the similarity measures [1].
Hierarchical and partition algorithms are the major kinds of clustering algorithms have been used [8].A hierarchical clustering algorithm generates a tree of clusters (groups) depending on two methods.The first method starts with one cluster then merges each two similar clusters, which is known as the agglomerative method.The second one starts from the whole data set as one cluster then split it into clusters at each stage, is known as the divisive method [9,10].A partition clustering algorithm uses a single step to divide the collection of documents in to predefined number of groups [11].The most widely used partition clustering algorithm is the K-means algorithm [12].It is an unsupervised learning algorithm that relies on selecting K clusters as Kcentroids.After that, the similarity measure is calculated between each document and the centroids, then the documents will assign to the closest centroid after updating of centroids multiple times [13].
In the present paper, the k-means cluster with two levels of genetic parallel is used for information retrieval.Multi-deme parallel genetic as first level and Master-Slave parallel genetic as second level.The idea behind using the K-mean clustering algorithm is to group a set of documents to clusters according to their similarity with a query, then an HPGA algorithm will perform a search in the most relevant clusters to reduce the search time and to provide optimal search results.Next, at each subpopulation there is a fitness evaluation parallelism with hybrid selection and two chromosomes crossover as genetic operators.Then migration among individuals and repeat HPGA steps n time until obtaining the optimal results.

TERM FREQUENCY -INVERSE DOCUMENT FREQUENCY (TF-IDF)
Datasets in most clustering algorithms are represented by a set of vectors, V = { V1, V2, V3… Vn}, where, Vi is the feature vector of one object.Term Frequency is a simple and effective term selection method, alike words are used in the documents that belong to the same subject, thus, term frequency can be a respectable indicator for a certain subject.TF is a term occurrence frequency in the document as shown in equation 1.On another hand, some terms should be removed such as words in the stop list corresponding to the English language, because the occurrence of these words is not relevant to identify the subject of the document [14].

TF(j, i) = frequency of i th term in document j
(1) TF is not effective to measure the frequent terms in a set of documents.Thus, Inverse Document Frequency (IDF) is used.TDF is the term frequency across a set of documents as shown in equation (2).
|D|, number of documents.|Dti|, number of documents that contain the term ti.
To determine the weight for each term ti in each document dj, TF and IDF will be combined by multiplication of the resulted values, TF-IDF given by equation 3 [15].In document clustering, terms with higher TD-IDF have better clustering.

GENETIC ALGORITHM
The genetic algorithm (GA) is a probabilistic meta-heuristic search algorithm inspired by natural genetics [16,17].GA gives a good solution in many life fields.Figure 1 demonstrates the flowchart of the genetic algorithm steps.The basic operations of a genetic algorithm are [18,19]: 1. Generate random solutions that are called a population.
2. Determine Fitness value to evaluate each solution.
3. Select the best solutions according to the fitness.

Produce a new population by genetic operators (crossover and mutation).
As employ the parallelism feature to reduce the process duration.There are three models of Parallel Genetic Algorithms (PGA) as exhibited in figure ( 2): (a) Master/Slave PGA which deals with single population and parallel fitness calculation; (b) multi deme PGA which deals with multi-population and parallel genetic operations followed by migration among them; (c) Cellular which deals with a single population running on a parallel processing system based closely-linked massively.The previous models can be hybridized to produce Hierarchical PGA (HPGA) models [20,21].

THE PROPOSED APPROACH
The Information Retrieval systems process a large amount of text in documents index and user query stages.Parallelism is a way to improve the query average time.The elaborated procedure uses a Parallel Genetic Algorithm with K-means to retrieve the most relevant documents to a user query that relies on the steps enumerated below, Figure 4 presents the proposed (K-mean)-HPGA approach:

Web Document Data Extraction
Web page extraction represents the interaction with web page source (HTML) to scrap the information, respectively to identify structured data as a post-processing stage that is composed of two steps:

Tree-based extraction
web pages have a semi-structured feature, therefore, this feature is considered the most important feature to represent the HTML tags and text as a labeled tree, which is called a DOM (Document Object Model) [22], and addressing the element's tag in the tree via XPath language.
4.1.2Text Tokenizer its purpose is to break the text in tokens, eliminating stop words and stemmer from tokens.The Stop Wordlist that we used, contains 1300 words which include articles (a, an, the), prepositions (in, into, on, at), conjunctions (and, or, but, and so on), pronouns (she, he, I, me), and other words irrelevant for the query process.Porter Stemming is used in our approach to enhance accuracy via dropping morphological variants of words.Thus, tokens with common stems such as -ED,-ING,-ION, and -IONS will have similar meanings.

Document and Query Representation
In this approach, Vector Space Model (VSM) is used, a features vector is generated from each document content and the given query, depending on the occurrence of words in the document by using TF-IDF function (the frequency occurrence of the term in the document (TF) with the frequency of occurrence of the term in the data set of documents (TF-IDF), as mentioned in equation 3).

K-means -Hierarchical Parallel Genetic Algorithms Approach
The idea behind using the Parallel algorithm is to split the task into a set of subtasks that will exhibit a divide-and-conquer behavior.In our approach we use multi-deme parallel genetic (multiple population) with k-means clustering.Steps bellow explain the algorithm operation in details:

Generate Population
Create the subpopulations from the web document dataset via the K-means algorithm.K-means split the documents to be indexed into k clusters then evaluate the last centroid with a query and select just clusters that are near from the query.

Fitness Evaluation
The second level of the Parallel Algorithm is applied to evaluate the fitness function in each cluster (subpopulation), i.e all documents in the cluster will be evaluated at the same time under the slave/master parallel concept.This evaluation starts by forwarding user query to each cluster then calculate the fitness function to each document of the cluster.In the present approach, a cosine similarity function is used as a fitness function [23].The cosine similarity function is given in equation 4. (4)

Genetic Operators
generate a new population by applying genetic operators (selection and crossover).To improve genetic performance, we move 4% of chromosomes with the highest probability in the next generation without change (i.e.apply Elitism Feature).Genetic Operators in (K-means) -HPGA flow the following steps:.a. Calculate the probability for each chromosome, where Rank the Probability values and take the top 4% Elitism to avoid the loss of fittest chromosomes in the new population.c.Hybrid Roulette -Tournament Selection (HRTS): It is the process of selecting a pair of parents from the population to emphasize fitter offsprings in a new population.In our approach we used a hybrid method to take advantage of both selection methods (Roulette wheel and Tournament).
The selection process is explained by the following algorithm:

HRTS Algorithm
Input: popsize, fitness.Output: parent1,parent2.Begin for j = 1 : 2 r = randi [1, popsize] //Select random number for subpopulation for i = 1 : r sumfitness = sum (fitness) We measured the improvements that were achieved by the proposed approach, with a precision of Information Retrieval by Genetic Algorithm (GA-IR) for three datasets.Tables 4, 5, and 6 presents a comparison between our approach and GA-IR.Improvement average is calculated for three datasets and the results were 25.6666, 27.4444, and 45.2222 respectively.Table 4. Comparison analysis of (K-means) -HPGA Approach and GA [26]  Finally, we compared the proposed approach with classic Information Retrieval (classic-IR) precision and the improvements were 34.4444% in NLP, 28.6666% in CISI, and 47% in CACM as shown in tables 7, 8 and 9.

CONCLUSIONS
After the tests and research for this paper, we concluded an information retrieval performance improvement: (k-means) -HPGA achieved higher precision and better quality in document retrieval.Also a reduction of irrelevant results in user search was observed.Our results were determined by comparing three common datasets (NLP, CISI, and CACM) with Classic IR and GA.The range of precision improvements for three datasets with Classic-IR was [28% -47%] while with GA-IR the precision was [25% -45%].

Table 1 .
The results of Recall, Precision and F-measure for 100 query in NPL Dataset(DS1) by using (K-means) -HPGA Approach