Algorithm for Information Retrieval optimization

When using Information Retrieval (IR) systems, users often present search queries made of ad-hoc keywords. It is then up to the information retrieval systems (IRS) to obtain a precise representation of the user's information need and the context (preferences) of the information. To address this problem, we investigate optimization of IRS to individual information needs in order of relevance. The goal of this article is to develop algorithms that optimize the ranking of documents retrieved from IRS according to user search context. In particular, the ranking task that led the user to engage in information-seeking behaviour during search tasks. This article discusses and describes a Document Ranking Optimization (DROPT) algorithm for IR in an Internet-based or designated databases environment. Conversely, as the volume of information available online and in designated databases is growing continuously, ranking algorithms can play a major role in the context of search results. In this article, a DROPT technique for documents retrieved from a corpus is developed with respect to document index keywords and the query vectors. This is based on calculating the weight (wij) of keywords in the document index vector, calculated as a function of the frequency of a keyword kj across a document. The purpose of the DROPT technique is to reflect how human users can judge the context changes in IR result rankings according to information relevance. This article shows that it is possible for the DROPT technique to overcome some of the limitations of existing traditional (tf × idf) algorithms via adaptation. The empirical evaluation using metrics measures on the DROPT technique carried out through human user interaction shows improvement over the traditional relevance feedback technique to demonstrate improving IR effectiveness.


INTRODUCTION
Recent years have witnessed ever-growing amount of online information. The development of the World Wide Web (WWW) led to an increase in the volume and diversity of accessible information. Today, the Internet is a huge information repository. The question that now arises is how access to this information can be effectively supported. Users require the assistance of tools aimed to locate documents that satisfy their specific needs. This article discusses and describes a DROPT technique for information retrieval (IR) in an internet or designated databases environment. Our particular goal in this article is to develop algorithms that optimize the ranking of documents retrieved from IR systems to individual information needs in order of relevance. In particular by utilizing context for personalization through implicit relevancefeedback, we believe there should be a reformulation of search queries to satisfy individual user information needs. However, the ranking of documents relevant to a user information needs is increasingly difficult because the number and variety of documents available on the web has grown exponentially. This growth has driven the need to search for documents that match a user's specific information needs exactly.
In general, the quality of personalized search depends on the quality of the user-specific information provided, e.g. users' queries. For example, if users are unfamiliar to expressing their information needs with queries, traditional relevance feedback cannot provide suitable documents and the user may in turn be dissatisfied with the search results. Before users are presented with the documents, the search engines use a ranking mechanism to show the most relevant documents at the top and least relevant ones at the end. Towards this, a variety of ranking models have been developed to complement traditional relevance feedback [1]. One of the most popular and successful techniques used in ranking model is [2].
To address this problem, we propose a new DROPT technique, in which users' preferences (retrieved documents) are ranked in order of relevance to satisfy individual user information needs. We have extended a traditional ( × ) model to perform this ranking tasks because it provides a clear formalism for ranking users' preferences in order of relevance. Utilizing search context will help IR systems provide personalized search results to individual users' context of the users' current situation, such as search stage, user knowledge, and user preferred search results etc. In this article we argue that ranking of retrieved documents according to relevance through human user interaction behaviour is an important contextual factor that affects users' preference of search results. This implicitly-derived evidence can be used to reformulate queries through implicit relevance feedback, thus helping individual users to complete their ranking tasks more efficiently and effectively. One step in optimizing the IR (search queries) is the deployment of our proposed DROPT algorithm.
The rest of the paper is organized as follows. Section 2 discusses related work. In Section 3, we extend the traditional ( × ) to discuss and describe our DROPT technique in order to perform the ranking tasks. The experimental methodology is described in Section 4, while we present in Section 5 the results obtained for the proposed technique. Section 6 concludes.

II. RELATED WORK
This section briefly summarizes related works, and in particular, research that has influenced the focus of this article. Text-based IR is one of the oldest areas of research in the field of Computer Science, and a number of approaches have been devised over the years [3,4]. Techniques based on 'bag-ofword' representations, where the frequency of terms in documents are used to define a vector space model (VSM) remain prevailing, with variations of the Term Frequency-Inverse Document Frequency model (TF-IDF) being the most popular technique [3] and find use in practice. The research reported in [5] proposed improvements of TF-IDF based text search to enhance effective ranking performance.
A document ranking technique is an algorithm that tries to match documents in the corpus to the user, and then ranks the retrieved documents by listing the most relevant documents to the user at the top of the ranking. This is achieved by the IR system that matches the keywords against the document index in order to retrieve matching documents. There are popular Internet search engines designed for retrieving online documents and are typically perceived to do an excellent job in finding relevant information. However, the studies reported in [6,7,8] highlighted that users interact only with a limited number of search results typically among the first page. The authors demonstrated that information seekers usually choose some relevant information within the first page of results having viewed very few documents. Uncertain about the availability of other relevant documents, most users end their search sessions after one or two iterations. The only way to satisfy user information needs is to search on a continuous basis. This is a time-consuming task.
Relevance information is a vital factor for determining the relevance weight, but getting this information is crucial. We can achieve this by using the user feedback on retrieved documents, which indicates documents that are relevant and otherwise. In the case of an Internet search its importance becomes critical. Reference [9] proposed a ranking technique for multi-search projections on the WWW for a results aggregation model based on query words, search results, and search history to achieve the user's intention. To this end, the WWW can offer a rich context of information that can be expressed through the relevancy of document contents. Reference [10], proposed a model for online learning that is specifically adequate for user feedback. The experiment conducted shown retrieval effectiveness for Internet search ranking. In the context of an Internet search ranking, it is important these techniques aim at finding the best ordering function over the returned documents. The authors argue that regression on labels may be adequate and, indeed, competitive in the case of large numbers of retrievals. To make the WWW more interesting, there is a need to develop an effective and efficient ranking algorithm to deliver more suitable results for users. User information needs modelling is utilized in an effort to define a relevance model from the users' perspective to improve retrieval effectiveness [11].
Ideally, the relevance of documents should be defined based on user preferences. So the problem of ranking of retrieved documents is to sort documents based on user preferences. Relevance is a standard measure utilized in IR to evaluate effectiveness of an IR system based on the documents retrieved. The work reported in [12,13] are studies on the concept of relevance and relevance assessment. The effectiveness of an IR system is determined primarily by the relevance assessment of the retrieved information [14,15,16]. Therefore, the concept of relevance, however, is one that is subjective and influenced by diverse factors. For example, the queries posed to the IR systems are, most of the time, not optimal in terms of describing the required information with respect to an individual user's information needs. To this end, user perception and user knowledge level are factors that influence the relevance of a retrieved document.
Certainly, to make the Internet more interesting, there is need to develop an appropriate and efficient ranking algorithm to deliver more suitable results for individual users. Usually, there are thousands of relevant documents for each query. Though, users typically consider only the top 10 or 20 results. The need for query operations arises from the user's difficulty to formulate queries without a full understanding of the underlying collection and the IR environment [17]. Over the years, various techniques to deal with this ranking problem have been proposed. Reference [18] divides these techniques into two broad categories: global methods that use information independent of the query, and local methods that adjust a query relative to the documents that initially seem to match it.
To this end, the idea of context for personalization relates to the fact that human preferences are heterogeneous, multiple, and changing, and should be understood with the user goals in mind. Aiming to address the discrepancies, the question that now arises is how search can affect the information seeker's interaction with the IR system, his expectations and judgements about retrieved documents can be supported effectively based on our proposed DROPT technique.

III. THE PROBLEM FORMULATION
Let us define this problem in the document content analysis by self-learning. Assume that for a query we have a set of documents with associated relevance numerical weight values where as normalization interval which prompt a relevance order among the documents . Here 1 is the maximum relevance numerical weight value corresponding to 'highly relevant' and value 0 corresponds to 'irrelevant'. For example, using the relevance context information implies that document is preferable to document . This will express user's degree of interest by pairwise comparison of documents. It is our goal to rank retrieved documents according to relevance numerical weight, such that the documents with relevance value will show up at the commencement of the ranked list, rather than documents with relevance value . This optimization of IR is obtained by ranking the documents according to a relevance numerical weight value , which is obtained from the weighting function w in descending order. Then we wish to return a relevance numerical weight subset of such that for each D, we optimize the following weighting function: where, is the term frequency in the query-document pair, idf = log (N/n i ), n i is the number of documents indexed containing term ; is the total number of documents in the corpus.
Based on work reported in [17,19], (1) notations suggest diverse approaches to this weighting function problem involve statistics to enhance retrieval effectiveness. In this paper, we have therefore extended in Sub-section 3.1 a traditional × model to perform clear formalism for ranking users' preferences based on notation (1).

EXTENDING TRADITIONAL ( × ) MODEL TO DROPT TECHNIQUE
In this section we study the problem of ranking of retrieved documents from the search engine back end prototype developed in this study (as interaction between the information user and the information source), such that the most relevant documents are retrieved at the start of the list, given a query q. For example, we desire to rank a set of scientific articles such that those related to the query 'information retrieval' are retrieved first. The basic assumption we make is that such a ranking can be obtained by a weighting function which conveys to us how relevant document is for query . The document ranking will be done by taking a weighted average of all determined parameters.
The requirement of being able to deal with each (document-query) pair independently arises from details of practicality on search engine back end prototype. To search through a large collection of documents efficiently it is preferable to assign a numerical weight to each document individually. In this respect, users are often only interested in the most relevant documents rather than the entire ranked list. For example, in the case of most Internet searches it is likely that users will only want to look at the first 10 retrieved search results. Similarly, when retrieving documents, a user may only be interested in considering viewing the best top n documents. The focus of this paper is to provide a limited number of ranked documents to the end user. Alternatively, the user's satisfaction with the system may depend on how many documents they need to scrutinize until they find a relevant one. Therefore, this paper is concerned primarily with the retrieval of the most relevant documents according to information relevance, rather than with all of them.
Our approach to ranking of retrieved documents is centred on self-learning, the weighting function with required adaptively properties. This is in contrast to past strategies in IR which rely on viewing the documents as information overloads to obtain weighting function without considerations for underlying document content analysis. The semantic similarities between terms in documents, which attracted the interest of many researchers who realize that viewing query terms as relevant information is limiting. Therefore, this paper takes advantage of query terms occurrences and self-learning to guide us in finding a weighting function that can automatically adjust its search structure to a user's query behaviour. In this regard, a good ranking criterion remains the choice of an IR system expert.

FORMALIZATION OF MATHEMATICAL MODEL DEFINITIONS FOR DROPT ALGORITHMS
Based on (1), a DROPT measure for documents retrieved from a corpus is developed with respect to document index keywords and the query vectors. Naturally, given the notation we present for the problem, the use of statistical methods has proven both popular and efficient in responding to the problem formulation [17,20]. This based on calculating the weight of keywords in the document index vector, calculated as a function of the frequency of a keyword across a document .
The DROPT technique is based on IR result rankings, where a ranking R consists of an ordered set of ranks. Each rank consists of a relevance numerical weight value , where v represents the relevance numerical weights of the retrieved documents. Each rank is assigned an ascending rank number n, such that: Our DROPT technique is composed of six steps.
Step 1: Initialization of Parameters (a) Let a query vector, Q, be defined as: where, , being a term string with a weight of 1.
(b) Let the indexed document corpus be represented by the matrix: where, being an index string, with weight .
(c) We compute the convolution matrix W = DQ by a simple multiplication of the document vectors and the query vectors representing: are weights of terms in the query vectors, while n is the number of retrieved documents that are indexed by at least one keyword in the query vector. The matrix W gives a numeric measure with no context information.
Step 2: Search String Processing The comparison of the issued query term against the document representation is called the query process. The matching process results are a list of potentially relevant context information. Individual users will scrutinize this document list in search of the information they needs. The goal of context information acquisition should be to determine what a user is trying to achieve while performing his\her matching tasks. (See Section 3.3 for the matching rules).
Step 3: Calculate Relevance Weight Retrieved documents that are more relevant are ranked ahead of other documents that are less relevant. It is important to find relevance numerical weights of the retrieved documents and provide a ranked list to the user according to their information requests.
(a) Based on (1), the relevance weight is obtained according to document content.
(b) Subsequently we calculate the average mean weight using the weighted root mean squares (RMS) to determine the overall fitness value of retrieved documents with respect to a given query calculated as: where, w is the average relevance mean weight of each retrieved document, n is the number of keywords terms occurrences in each retrieved document, l is the total size of the keywords in the corpus, and w ij are the sum weights of terms of the document vectors.
Step 4: User Feedback about Retrieved Documents User feedback about retrieved documents is based on overall relevance weights to construct a personalized user profiling of interests. We can achieve this when a user indicates the documents that are relevant or otherwise, from the designated databases context.
(a) The overall relevance judgment is given by: nxl (7) where, and 1 ≤ i ≤ n, 1 ≤ j ≤ l and G is a query vector with a small-operator defined as a matrix, are weights of terms of the document vectors, and are queries vectors. Any numerical weight component of matrix G greater than the average mean weight, (6) will be retained to add to a matrix T given by: (c) Thus, any document whose value was higher than the overall average relevance weight would be predicted as a relevant document; any document with a lower value would be predicted as irrelevant document (9). Thus average relevance mean value within the normalization interval is computed for each document given by: Step 5: Relevance Judgment The individual user is asked to judge contextual factor (e.g. information relevance) influence on ranking given a certain contextual dimension (numerical weight is relevant or irrelevant).
(a) If the ranked document is relevant to user information needs, the user finishes his/her query search context, then GO to Step 4 according to the user's document preference.
(b) Otherwise, the user continues to search the document databases by reformulating the query or stop querying the designated database until relevant documents are ranked. GO to Step 6.
Step 6: Update Term Weight and Keywords Set The keyword term set n provided by the ranked documents and the relevance numerical weight values will be updated by user feedback.
(a) Any new query term not belonging to n will be added and a new column of relevance weight value will be computed and expanded for ranked documents routinely.
(b) If any ranked document d i is retrieved by the users, the corresponding relevance weight values with respect to the query keywords will be increased by (11). The default of β is set to increase the corresponding relevance numerical weight values. β, where, and We coined the acronym DROPT to name our adaptive algorithm that provides a limited number of ranked documents in response to a given query. Also it can improve the ranking mechanism for the search results in an attempt to adapt the retrieval environment of the users and amount of relevant context information according to each user's request. Finally, the DROPT measure must be self-learning that can automatically adjust its search structure to a user's query behaviour.

MATCHING RULES MECHANISM
This Sub-section describes the principles of matching process reported in this research. Towards this, a user must specify some information, considered as context pertaining to the query. This context (preferences) provides a high-level description of the users information need and eventually control the search strategy used by the system. In this paper, we focus on modelling the information using rules that best matches user's interest to judge the relevance of competing information need models. Such rule states, among a set of conditions, a particular YES or NO together with a weight. The rules are shown in Table 1. Each of the cells in Table 1 represent IF < CONDITION> THEN < ACTION> Statement.
Users can express conditions regarding the values of a preference. For example, the first cell in the Table 1 above is a statement IF < Matching = Y; Feedback = P > THEN < Judgment = HR >, where Y represents matching condition value "YES", P represents feedback value "Perfect" and "HR" represents relevance judgment value "Highly Ranked" respectively. These judgment rules rely on obtaining information from a domain of expert by scoring each of the The notion of user preference has been discussed in the literature of IR, although its relevance has perhaps not been fully explored. Based on [21] investigation, the concept of user preference is adopted for the measurement of the relevance of documents in this present study. A user preference relation has been applied in this research to provide a suitable means for "pairwise comparison of documents". Given any two documents d, d 1 Ԑ D, where D denotes a finite set of documents. We assume that a user is able to decide if one document is more or less relevant than another based on the relevance weight of the document. Our goal is to establish a basis for the representation of user judgments on the relevance of documents within the normalization interval scale v Ԑ{0,1}.The user preference relation can be defined by binary relation > on D as follows: This expresses user's degree of interest. In this study, however, a rule-based context system associates a set of inputs (conditions) with a set of rules to obtain an output (judgments). The facet level of document judgment was proposed in [22], and it includes two values: segment and document. Segment level tasks require locating specific information within a page, while document level tasks only require users to judge if a page is relevant in general but do not necessarily require locating specific information.

IV. METHODOLOGY
This Section describes the experimental methodology. We involved three system users (all PhD students) to collect data through the WampServer search engine back end prototype. The three study system user participants were given 10 search tasks each in their domain of knowledge. During the search context, the students' interactions with the search engine back end prototype were logged via the system log in menu with their "student identification number". In each task, the students were asked to obtain the frequency of keyword matching based querying across a document that were relevant to meet their information requests to achieve document ranking task based on individual users' preference, or ignore documents that were found to be irrelevant. The user behavioural measures we examine are the frequencies of the issued query. The function of the frequency of the keyword across a document from the document database collected is stored in the WampServer site localhost database. WampServer is a Windows Internet environment that allows user to create Internet applications with Apache 2, PHP and a MySQL database. PHP Myadmin allows user to manage easily our databases. This measure was used to predict the 'relevant" documents marked 'X" for document ranking model. To evaluate the performance of the proposed technique, we performed an experiment on small scale search of different 30 queries from the system users to validate the effectiveness of the technique. Table 2 gives the statistics of the queries considered in the experiment.

V. EXPERIMENTS
A Search engine back end prototype (which was wrapped around Google) is created for the domain of three systems user experts. The results and discussions are presented in Section 5.1, while ranking performance results is discussed in Section 5.2. Performance evaluations are given in Section 5.3.

RESULTS AND DISCUSSIONS
In order to generate the prediction user model of the document ranking context, we used the weighting method to calculate the most relevant document marked X as crucial predictors (relevance numerical weights) to generate data for the adaptation of retrieved documents. For the document ranking models generated according to relevance weights; the average relevance weights of individual users was obtained ϖ = 0.663 for Domain 1, ϖ = 0.85 for Domain 2, ϖ = 0.735 for Domain 3, and the overall average relevance weight, ϖ = 0.75 was obtained for the three Domains of experts combined. Thus for Domain 1, any document whose value was higher than 0.663 would be predicted for ranking as a "relevant" document, and marked 'X'; and any document with a lower value would be predicted but ignored if found to be "irrelevant" later. Also, for Domain 2, any document whose value was higher than 0.850 would be predicted for ranking as a 'relevant' document and marked 'X'; and any document with a lower value would be predicted but ignored if found to be "irrelevant" later. Lastly, for Domain 3, any document whose value was higher than 0.735 would be predicted for ranking as a "relevant" document, and marked 'X'; and any document with a lower value would be predicted but ignored if found to be "irrelevant" later.
We generated three prediction models, each from domain of experts with different generated data from the user behaviour measure context when the matching tasks were considered. This shows that any document whose value was higher than 0.75 would be predicted for ranking performance results at known "relevant" document, and marked 'X'; and any document with a lower value would be predicted but ignored if found to be irrelevant later. Our goal is to appropriately predict "relevant documents" for ranking performance results based on user preference. Therefore, we measured precision and recall of relevant documents, marked 'X'.
Our results on the indexed keywords represent the domain of the system user's experts (three PhD students) in an in-lab experimental setting. The results demonstrate that combining individual system user's behavioural measures can improve ranking prediction accuracy (according to relevance weights), for documents ranking, and however that individual users ranking performed much better than combining document rankings of the systems. This accomplishes the adaptation of retrieved documents for individual users as the focus of this paper. The retrieval effectiveness is measured using wellknown metrics at known relevant documents.

RANKING PERFORMANCE RESULTS
With the intention of measure ranking performance, the DROPT technique for ranking search results list was tuned by experimenting with the prototype system for relevance judgment. Each query produced a document based on the matching conditions and the retrieval was repeated for 10 query reformulations from the domain of system user experts. The underlying philosophy of the relevance judgment rules for user model judgment using the DROPT technique is to rank those documents, which exceeded the overall weighted fitness score that the system user judges to be relevant to his/her information needs, and ignore those documents the system users judge to be irrelevant (less preferred).
According to Table I, the values displayed in Figure 1 shows the 30 search results of the proposed technique for documents retrieved from a localhost database search engine back-end. Documents are sorted and were set in ascending order of Retrieval Status Values (RSV). Hence, documents whose relevance weight fall above the (linear weight) with overall weighted fitness score (F=0.75), as shown in Figure 1, are considered as relevant documents. Hence, 19 documents are ranked and given to users to meet their information needs. Conversely, the 11 search results that fall below (linear weight) are rejected by the user (not displayed), as shown in Figure 1. This section presents the results that show the performance of our technique against a traditional tf-idf method. We compared our ranking algorithms with selected well-known baseline algorithms such as TF-IDF to evaluate the performance of our ranking technique in standard "Precision at position n" (P@n) measure. For the information needs and document collection of the experiment, relevance was assessed by different experienced system users in their domain of experts (three PhD students). They are knowledgeable in their domain and were asked to judge the relevance of the retrieved documents on a six level scales: (0=Harmful, 1=Bad, 2=Fair, 3=Good, 4=Excellent and 5=Perfect) with respect to a given query. For comparison of results, we have used P@n metrics [23]. Precision at n measures the relevancy of the top n results of the ranking list with respect to a given query (12).
P@n=Number of relevant document in top n results / n (12) P@n can only handle cases with binary judgment "relevant" or "irrelevant" with respect to a given query at rank n. To compute P@n, 30 queries were judged in these six levels by users.
For the evaluation of the algorithm, testing of the prototype system was conducted. The test process involved using the 30 queries provided by the system users. The measure (P@n) is used for the evaluation. Naturally, this is computed for each query, and then takes the average dimension (n) for all queries. Figure 2 shows the comparison of the DROPT algorithm with other algorithms in the P@n measure. As the figure shows, our adaptive algorithm outperforms TF-IDF model. The DROPT algorithm achieves a 25% in P@n compared to TF-IDF. The empirical results have been compared with the traditional relevance feedback model. It shows that the precision value of the proposed ranking technique is comparatively higher for all the query sets. This achievement resides in the combination of context-based algorithms using user preferences for query reformulations. In this regard, the number of top n results showed to users depicts the relevancy degree of the retrieved documents with respect to a given query with rank n (judged by the system users).
The document corpora developed for this study is available in a designated database; hence data will be shared with others for research related activities. The corpora were manually built with minimal number of documents for evaluation purposes. For easy evaluation and scalability issues, we use our manually built corpus to evaluate the effectiveness of our developed algorithms. The empirical results have been compared with the traditional relevance feedback model. However, looking at the future, we will optimize the developed algorithms to work on a larger corporal. In this paper, we have proposed a new document ranking technique in IR with the intention of retrieving context information ranked according to information relevance. The proposed adaptive DROPT technique is to improve the relevance of retrieved documents. The DROPT technique adapts itself with individual user information needs based on environment and search context. In this respect, our algorithm can judge the relevance of the current search results according to the relevance weight of the retrieved documents. The technique demonstrated in providing a limited number of ranked documents in response to a given users' query. The new DROPT technique combined approaches suitable for user profiling and mining of user interest for the enhancement of IR system performance, which satisfy the focus of this paper. The DROPT technique is designed purposely to overcome some of the limitations (e.g. low precision and recall, and not adaptive to users) of existing traditional ranking algorithms that ignore the semantic analysis of the document itself. We have used content-based algorithms such as TF-IDF as baseline in comparison with our DROPT algorithm for user preferences to predict our ranked documents. We used the search engine back-end prototype as an interaction between the information user and information source for document collections and 30 queries from the domain of the three system user experts for evaluation of our DROPT technique. The evaluation was carried out in an in-lab setting, whereby the number of relevant and non-relevant documents in the localhost site was known.
Evaluation of the DROPT technique shows performance results improvement using 'precision at position n' i.e. P@n, over the chosen baseline algorithms methods. The proposed DROPT algorithm has some interesting features like scalability and adaptability. It is scalable in that any new algorithm can easily be added for comparison and also adaptable, in that it adapts itself based on user information needs in the environment.