Top-k matching queries for ﬁlter-based proﬁle matching in knowledge bases ⋆

. Finding the best matching job oﬀers for a candidate proﬁle or, the best candidates proﬁles for a particular job oﬀer, respectively constitutes the most common and most relevant type of queries in the Human Resources sector. This technically requires to investigate top-k queries on top of knowledge bases and relational databases. We propose in this paper a top-k query algorithm on relational databases able to produce eﬀective and eﬃcient results. The approach is to consider the partial order of matching relations between jobs and candidates proﬁles together with an eﬃcient design of the data involved. In particular, the focus on a single relation, the matching relation, is crucial to achieve the expectations.


Introduction
The accurate matching of job applicants to position descriptions and vice versa is of central importance in the Human Resources (HR) domain. The development of data or knowledge bases (KB) and databases to which job descriptions and curricula vitae (CV) can be uploaded and which can be queried effectively and efficiently by both, employers and job seekers is of high importance. Finding the best matching job offers for a candidate profile or, the best candidate profiles to a particular job offer respectively, constitute the most common and most relevant type of query, which technically requires to investigate top-k queries on top of knowledge bases and relational databases.
A profile describes a set of skills either, a person posses detailed in form of a CV or, described in a job advertisement through the job description. Profile matching concerns to measure how well a given profile matches a requested profile. Although, profile matching is not only concerned to the Human Resources sector but a wide range of other application areas, real state domain, matching system configurations to requirements specifications, etc. The research in this paper is in line with a previous work [5] where an approach on improving profile matching in the HR sector is introduced. For this, the starting point is based on exact matching [6] that has been further investigated in [7].
With respect to querying knowledge bases in the HR domain, the commonly investigated approach is to find the best k (with k ≥ 1) matches for a given profile, either a CV or a job offer [2]. This constitutes what is commonly known as top-k queries. Top-k queries have been thoroughly investigated in the field of databases, usually in the context of the relational data model [3,4,9]. The study of such queries in the context of knowledge bases has also been researched [8].
The most relevant queries in the human resources sector, are matching queries driven either by a CV (or a set of CVs) or by a job offer (or a set of job offers). These queries can characterized as top-k queries, skyline queries in case of partial orders on the matching measures or a combination of these. Top-k queries in relational databases are in general addressed by associating weights or aggregates acting as a ranking to the part of data relevant to the user's needs, a potential join of the relevant relations involved and, a ranking (or sorting) of the tuples that constitutes the expected result set. Computing all these steps at once can be a process able to consume many resources, depending on the design and nature of the data.
Our contribution in relation to top-k queries in relational databases and knowledge bases takes benefits of the partial order on matching measures and knowledge bases equipped with matching relations. The expectation is of course, that many of the results in the relational data model can be easily adopted to this case. In particular, the focus on a single relation, i.e. the matching, as the driver for the querying, is expected to ease the extension. This requires to investigate the supporting data structures. In view of the many results on efficient top-k queries in the context of the relational data model it is expected that these results can be largely achieved by adaptation to the case of knowledge bases, in which data structures for the support of hierarchies can be adopted from databases. The objective is to minimize the selection of tuples as well as eliminating the calculation of weighting (scoring) of tuples on the query itself, by making use of weighting on the partial order of concepts of knowledge bases by means of matching measures.
The paper is organized as follows: In Section 2 we cover the main aspects of our theory on profile matching introduced in a previous work [5]. The internal physical representation of profile matching is introduced in Section 3. In Section 3.1 we introduce our approach of a relational database schema to implement topk queries and in Section 3.2 we show an algorithm implementing our approach of top-k queries.

Preliminaries
We have presented in a previous work [5] our representation of profile matching in the HR domain. It was shown in that work how we represent CVs and jobs profiles in a KB as well as the syntax and the semantic of the language used to represent the terminology of the KB. We also elaborated a matching theory to calculate the matching measures between two given profiles (CV and job offer) and the so called blow-up operators of a KB. We briefly refresh some of those concepts involved in the elaboration of queries.
Concepts C i in a TBox of a KB define a lattice (L, ≤) with ⊓ and ⊔ as operators for the meet and the join respectively, and ⊑ for the partial order of elements of a KB, and the closure under ⊓ and ⊔ for concepts C i , ⊤, ⊥. In the following, we refer to concepts C i in L to denote concepts C i in a given KB. Thus, the terms TBox and lattice are used as synonyms from now on.
Concerning the formalism for representing the knowledge, a subset of the description logic SROIQ is used in [5]. As for the semantics, concepts are given a set-theoretic interpretation where a concept is interpreted as a set of individuals and roles are interpreted as sets of pairs of individuals. The interpretation domain is arbitrary and can be infinite. Then, there is an interpretation I consisting of a non-empty set ∆ I called the interpretation domain and, an interpretation function that associates specific concept names in a TBox to individuals of the universe. Then, it associates every atomic concept C i to a set ∆(C i ) ⊆ ∆ I and, to every role R a binary relation ∆(R) ⊆ ∆ I × ∆ I .
A filter in a lattice (L, ≤) is a non-empty subset F ⊆ L such that for all C, C ′ with C ≤ C ′ whenever C ∈ F holds, then also C ′ ∈ F holds.
If P ⊆ I is a profile, P defines in a natural way a filter F of the lattice L of concepts: Therefore, for determining matching relations we can concentrate on filters F in a lattice.

Filter-Based Matching
Let (L, ≤) be a lattice, and let F ⊆ P(L) denote the set of filters in this lattice. A relative weight measure on L is a function m : A matching measure is a function µ : holds for some relative weight measure m on L and any F 1 , F 2 ∈ F. The matching measure µ defined in [6] uses cardinalities Thus, it is defined by the relative weight measure m on L with m(A) = #A/#L. Let w be a weight associated to every concept C ∈ L then, a matching measure µ is defined by weights w(C) = m({C}) ∈ [0, 1] such that Example 1. A simple lattice with four elements: L = {C 1 , C 2 , C 3 , C 4 } defines up to five filters F = {F 1 , F 2 , F 3 , F 4 , F 5 }, as shown in (a) and (b) Fig. 1, respectively.

Fig. 1: A lattice, its filters and matching measures
If we give some weights to the elements of L, for instance w(C 1 ) = 1 10 , w(C 2 ) = 2 5 , w(C 3 ) = 3 10 , and w(C 4 ) = 1 2 and calculate the matching measure µ(F i , F j ) and µ(F j , F i ) (for 1 ≤ i, j ≤ 5) with the formula in (2), we obtain the result shown in Fig. 2.

Fig. 2: Matching Measures
It is easy to note at a glance on Fig. 2 that in general, the matching measures are not symmetric. If µ(F g , F r ) expresses how well a given filter F g matches a required filter F r , then µ(F r , F g ) measures the excess of skills in the given filter Example 2. Take for instance, F r = F 3 as a required profile and two given filters F g1 = F 3 and F g2 = F 4 . They are both equally and highly qualified for the requirements in F r given their matching measures: µ(F r , F g1 ) = 1 and µ(F r , F g2 ) = 1 2 F 3 matches better than F 4 as C 2 is not part of the required skill set.

Internal Structure of Profile Matching
In a modeled selection process where there is a set of profiles P, i.e., job and applicants profiles, defined by filters in a lattice L, we denote by ϕ the conditions to be met by profiles (either job or applicants profiles) in order to be selected, then P ϕ denotes the set of profiles in P satisfying ϕ and P r ∈ P is a required profile driving the selection by holding the conditions ϕ.
Note that, when referring to matching measures from now on, we refer on matching measures as in to formula (2) that includes weighting on the elements of the lattice L, as shown in Example 1.
Definition 1. For all P ∈ P ϕ and P ′ ∈ (P − P ϕ ), P is selected and P ′ is not selected if µ(P, P r ) > µ(P ′ , P r ) and no subset of P ϕ satisfy this property.
In order to obtain the best-k matching profiles (either job or applicant profiles) we first need to query for filters representing those profiles.
Consider F r being a filter representing the required profile P r , a requested job profile for instance. Then, consider l being a number of filters in F (F g1 , . . . , F g l ) representing candidates profiles matching P r in a certain degree, satisfying ϕ such that, their matching measures are above a threshold t i ∈ [0, 1] this is, Then, every F gx represents a finite number j of profiles (P g1 , . . . , P gj ), candidates profiles matching P r , where µ(P gy , P r ) ≥ t i for y = 1, . . . , j and j ≤ k.
Note that, the relation between filters in L and the number of related profiles represented by filters is defined by a function ν : N → N where ν(x) = j and l x=1 ν(x) = k. Then, any F g l+1 is not selected as the matching value As for the second part of the definition, each filter F ∈ F is uniquely determined by its minimal elements such that, we can write F = {C 1 , . . . C r }. Then, every profile represented by a filter is also uniquely determined by the elements in F . Therefore, for any profile P ′′ in a subset of P ϕ the matching value is µ(P ′′ , P ) < t i then P ′′ does not satisfy the property.
Example 3. Assume to have a job offer profile P a and four candidates profiles {P b , P c , P d , P e } that meet the requirements in P a . Let's also assume that the five profiles are represented by the filters in Example 1 such that: In order to obtain the best l filters satisfying ϕ, we first need to know the minimum matching value representing l filters. Thus, we start by selecting any t i . If less than l solutions are found, we increase t i (t i+1 ). If more than l solutions are found, we decrease t i (t i−1 ). The search stops when the l filters satisfying With the optimum t i , we query for the related k profiles where µ(P g , P r ) ≥ t i . This assumes to be given the matching measures between all filters in L and ultimately, between all profiles represented by filters.
As exposed in Example 1, matching measures between filters define a matrix (Fig 2) so do matching measures between profiles although, the number of filters is assumed to be smaller than the number of profiles (l ≤ k). If we take for instance profiles P a , P c , P d from example 3 we have the minimum number of profiles producing matching measures to define a matrix of profiles.  Obtaining the k solutions in M involves refer either to one column or to one row. The the process is analogous if we focus either on rows or columns although, the perspective is different. While reading the measures from the columns perspective provides the so called fitness between profiles µ(P g , P r ), the measures read from the rows perspective are the inverted measure µ(P r , P g ) denoted as overqualification. Overqualification may be considered as emphasized in Example 2 where profiles that equally (or almost equally) match the requirements, maybe subject to a second ranking with respect to the inverted measure.
If we focus on columns, when querying for a particular P r representing the required skill set, there would be finitely many profiles P g matching P r . Although, we only focus on the k elements where µ(P g , P r ) ≥ t i . Ideally, all elements in the column are in total order according to the ≤ relation of µ(P g , P r ). The advantage in here is that when searching for any given k and t i we only need to point to the right element in the column and search for the next consecutive k − 1 elements in descending order of matching measures.
If searching for the less overqualified, there would be finitely many P r in every row of the Matching Matrix when querying for a particular P g but we only need to point to the right t i and search for the next k − 1 elements in ascending order of µ(P r , P g ).

Example 4.
Profiles P a , P c , P d are the minimum number of profiles from Example 3 regarding fitness

Fig. 4: Matching Measures
We explain next how we organize profiles in order to provide an efficient search of these elements when querying for the best k-profile matching. We first assume an identification label for every row and column in Matching Matrix M, where ρ i represents a number i of rows and σ i represents the number i of columns, for i > 0. Definition 3. A profile record of a required profile P r in column σ i in M, is a finite number of elements (µ i , n > i , n = i , n < i , next, prev, p) where µ i denotes the matching measures µ(P g , P r ) for every matching profile P g and n > i denotes the number of profiles P g in σ i where µ(P g , P r ) > µ i , n = i denotes the number of profiles P g in σ i where µ(P g , P r ) = µ i , n < i denotes the number of profiles P g in σ i where µ(P g , P r ) < µ i , next is a reference to the next matching value in σ i where µ(P g+1 , P r ) ≥ µ i , prev is a reference to the next matching value in σ i where µ(P g−1 , P r ) ≤ µ i and, p is a reference to a linked-list of profiles matching P r .
The numbers n > i , n = i , n < i are significantly important when determining the number of profiles (either applicants or job profiles) represented by a filter without actually querying for them. It is easy to determine whether (n > i + n = i ) ≥ k when querying for the pair (P g , P r ). Then, search for the next pair (P g+1 , P r ) if that is not the case. References Next and Prev make possible to track the following greater or smaller matching value of profiles by following the references. Every µ i contains additionally a reference p to the related profiles (jobs or applicants) in σ i column. All the related profiles are organized in a linked-list like structure of profiles, ordered by the smaller-than-or-equal elements of matching values.
Example 5. Consider the profiles {P a , P b , P c , P d , P e } as in Example 3 with matching measures: µ(P b , P a ) = 1, µ(P c , P a ) = 0.63, µ(P d , P a ) = µ(P e , P a ) = 0.5. Consider also the graphic in Fig. 5, representing the profile records of column σ i in M corresponding to P a 0.5

Fig. 5: Linked list of matching measures
Organizing data in a structure like M with profile records implies a fast and efficient search of the k matching profiles where the main objective is to fetch the corresponding columns σ i and together with n > i ,n = i and n < i calculate how many of profile records we need in order to get k profiles, then follow the linked-list of profiles until the k elements are found. If we need k = 3 with t i ≥ 0.5 we could get to profile record 1 in Example 5 where there are in total 4 (2 + 2 + 0) matching profiles where: 2 profiles have matching measures ≥ 0.5, 2 profiles match P a with = 0.5 and, none matches P a with measures ≤ 0.5. then, we know we have to visit 2 profile records to get the total number of profiles. Of course, one can claim that starting on row 3 of Example 3 is the best approach. Therefore, an ordering on the elements of columns in M seems to be essential.
The definition of matching records on rows of Matching Matrix M is analogous to Definition 3 Definition 4. A matching record of a given profile P g in row ρ i in M, is a finite number of elements (µ i , n > i , n = i , n < i , next, prev, p) where µ i denotes the matching value µ(P r , P g ) for every matching profile P r and n > i denotes the number of profiles P r in ρ i where µ(P r , P g ) > µ i , n = i denotes the number of profiles P r in ρ i where µ(P r , P g ) = µ i , n < i denotes the number of profiles P r in ρ i where µ(P r , P g ) < µ i , next is a reference to a matching values in ρ i where µ(P r+1 , P g ) ≥ µ i , prev is a reference to a matching values in ρ i where µ(P r−1 , P g ) ≤ µ i and, p is a reference to a linked-list of profiles matching P g .
The consideration regarding ordering on elements of rows in M in Definition 4 is also to be consider in order to achieve an efficient retrieval of matching overqualified profiles for a given P g .
The following section shows an implementation of the Matrix M and profile records in a relational database schema. We define the database structure supporting our definition of top-k queries in Section 3.1 while in Section 3.2 we show an algorithm that implements our definition of top-k queries in profile matching in relational databases.

Implementation of Top-K Profile Matching
Our implementation approach of top-k queries as described in Section 3 is designed on a relational database schema. The schema HR shown in Fig. 6 is designed to store and maintain filters of a given lattice L, as well as profiles and matching measures of an instance of L for the implementation of top-k queries. Note that for the definition of the database schema we use the notation of the unnamed and logic programing perspective as defined in [1]. Where, under the unnamed perspective, a tuple a 1 , . . . , a n is an ordered n-tuple (n > 0) of constants of a Cartesian product dom n (where dom is the underlying set of constants, the domain). As for the programming perspective, a relation R with arity n is an expression of the form R(a 1 , . . . , a n ) where a i ∈ dom for i ∈ [1, n] are the attributes of the relation.
The database schema HR is composed by eight relation names: Filter, describes every filter in a given lattice L. If we consider the lattice L in Example 1, Filter contains 5 tuples: Concept describes all concepts in a Knowledge Base. An instance of Concept based on Example 1 is C 1 , C 2 , C 3 , C 4 . lattice. For instance, in order to specify the composition of filter F 2 in Example 1, tuples F 2 , C 1 , F 2 , C 2 have to be present in the relation. In relation Profile, ProfileName identifies an instance of profile P and Profile-Type describes either a given profile P g or a required profile P r . For every filter F in L, ProfileByFilter details profiles names in a given instance of L represented by F . FilterWeight, represents records of the associated weights assigned to concepts C in an instance of L, to be used in formula (2). For instance, in order to describe the weights assigned to a given profile P , being an instance of filter F 3 in Example 1, the tuples F 3 , P, C 3 , 3 10 , F 3 , P, C 1 , 1 10 have to be present in the relation. MatchingFilters represents the minimum matching measures between profiles as in matrix M (Definition 2) such that for two profiles, P g and P r ∈ P, the attribute MValue represents the measure µ(P g , P r ) and OValue represents the measure µ(P r , P g ). The attributes BMatch, EMatch and SMatch represent the numbers µ > i , µ = i , µ < i respectively, of the profile record as in Definition 3. And, the attributes BOverq, EOverq, SOverq represent the numbers µ > i , µ = i , µ < i respectively, as described in Definition 4. MatchingProfiles represents the linked-list of profiles per matching measure as shown in Example 5 such that for two profiles, P g and P r in P, the attribute Fitness represents the measure µ(P g , P r ) while the attribute Overq represents the measure µ(P r , P g ). Note that MatchingProfiles intents to be a representation of elements in Definitions 3 and 4 where the attributes Attributes NextF and NextO are references to other tuples in the relation (to the tuple with smaller Fitness and, to the tuple with the smaller Overq, respectively). The attribute PID is intended as an unique identifier of tuples within the relation where NextF and NextO reference to. We describe in detail the use of attributes NextF, NextO and ID in the following section.

Querying the Top-k Candidate Profiles
Filters from a lattice L represent the properties of profiles via the hierarchical dependency of concepts in L. Therefore, for every required profile P r in P there is a required filter F r ∈ L representing the profile. Thus, retrieving the top-k candidate profiles for a required filter from the database schema HR is mainly performed by querying on relations MatchingFilters and MatchingProfiles.
For every RequiredFilter in MatchingFilters there is a number of GivenFilter that satisfy the requirements with a specific matching measure. This can be seen as the minimum number of profiles producing all possible measures as in Definition 2. Then if we focus on column σ i of M there is matching measure for every GivenFilter satisfying the requirements (in every row of M). The attribute NextMV in MatchingFilters is a reference to another tuple in the relation defining the sequence of GivenFilter by their smaller-than-or-equal relation of elements of MValue.
In turns, for every RequiredFilter in MatchingFilters there is a RequiredProfile in MatchingProfiles where the attribute NextF is a reference to another tuple in the relation (representing the linked list of profiles as in Example 5). These references define an order on the tuples of MatchingProfiles given by the smallerthan-or-equal relation of elements of Fitness. Thus, if RequiredFilter is provided, retrieving the top-k profiles out of the set P ϕ as in Definition 1 can be done by pointing to the tuple with the greatest value of Fitness and following the references on NextF until the k tuples are reached and µ(P g , P r ) < t i .
The algorithm in Fig. 7 shows how to retrieve an ordered list of top-k profiles for a given filter in schema HR. Note that we use the notation of relational algebra with subscripts as in [1] thus, σ, π and ⊲⊳ are the selection, projection and natural join operators respectively. We also use numeric subscripts to denote relation attributes. For instance, π 1 (MatchingProfiles) is the projection of The algorithm accepts as inputs: the required filter F r and, -the number k ∈ N representing the number of profiles to be retrieved The output is composed by: the k given profiles P g1 , . . . , P gk , -the matching measures µ 1 , . . . , µ k and, -the over-qualification measures denoted o 1 , . . . , o k .
The algorithm starts by calculating the number of profile instances satisfying the conditions ϕ in F r and allocating it to variable SumP. This is calculated by adding up the values of EMatch in MatchingFilters where RequiredFilter = F r . EMatch, as well as BMatch and SMatch in MatchingFilters represent the number of given profiles P g matching a required profile P r in MatchingProfiles, (µ > i , µ = i , µ < i respectively, as in Definitions 3). If SumP is less than k, it is notified in line 3.
We make use of a temporary relation Results, deleted in line 5 and re-created in line 6. Results is used to recursively query for the ordered list of matching measures later in the algorithm.
The variable count where dom(count)∈ N in line 7, counts the number of profile found in relation MatchingFilters.
In line 8, the variable P r contains the required profile defined by the required filter F r . For this, we query ProfileByFilters.
Between line 11 and 20, we will literally go through every µ i in the column corresponding to F r in matrix M, which is basically done by searching in the linked-list structure in MatchingFilters. And, Between line 11 and 20, we get all related profiles as shown in Example 5 until the k elements are found, which is basically to search on the linked-list structure of MatchingProfiles.
To achieve this, GP maintains record of the GivenFilter in MatchingFilters to query for in MatchingProfiles until, either count > k (k profiles are found) or GP = 0 (there is no more µ i elements per F r ). Also, NextP maintains record of the next PID tuple to query for within MatchingProfiles until there is no more elements (NextP =0) in the linked-list.
Between lines 14 and 18, a recursive search on tuples of MatchingProfiles is performed and appended to relation Results. For this, the algorithm searches for all tuples in the linked-list of profiles in MatchingProfiles where the value of RequiredProfile is P r and the GivenFilter = GF . In line 15 the algorithm follows the references in Results by querying for the next element NextF in relation Results that is an instance of PID in MatchingProfiles. Every queried tuple in MatchingProfiles is appended to the tuples in Results.
In line 19, GF is recalculated in order to get the next GivenFilter to be searched for.
In line 17, the algorithm finishes by returning the GivenProfile, Fitness and Overq values of tuples of Results.
Note that for simplicity, we use in here SUM, MAX, MIN as the aggregate operators from SQL.
The algorithm in Fig. 7 is intended to search for the best-k profiles matching a required filter F r where overqualification has not been considered there. The reason behind it being that overqualification is to be queried the same way as fitness so we need an analogous algorithm as in Fig. 7 where the algorithm to retrieve the best Overq. measure in MatchingProfiles should follow the references on NextOV in MatchingFilters and NextO in MatchingProfiles.

Conclusion
In this paper we presented an algorithm to address top-k queries of matching measures that produce ranking. The approach is intended as an alternative to the sorting and merging of large portions of data in order to produce only very few elements as result, the best k ranked elements. For the implementation of the algorithm we made use of linked-list data structure on top of a relational database schema to store and maintain the ranked elements. We still have to take into account the identification of missing requirements on applicants profiles that are essential on the selection of the best candidates. This implies an investigation of gap queries on grounds of matching measures that is the focus of our future research.