A domain independent semantic measure for keyword sense disambiguation

Understanding the user's intention is crucial in human-machine interaction. When dealing with text input, Word Sense Disambiguation (WSD) techniques play an important role. WSD techniques typically require well-formed sentences as context to operate, and predefined catalogues of word senses. However, such conditions do not always apply, such as when there is a need to disambiguate keywords from a query, or sets of tags describing any Web resource. In this paper, we propose a keyword disambiguation method based on the semantic relatedness between words and ontological terms. Taking advantage of the semantic information captured by word embeddings, our approach maps a set of input keywords to their meanings within a given target ontology. We focus on situations where the available linguistic information is very scarce, hampering natural language based approaches. Experimental results show the feasibility of our approach without previous training for target domains.


INTRODUCTION
In any information system which requires user interaction, being able to understand the user intention is a crucial requirement. In particular, being capable of disambiguating the input words is frequently the starting point of an interpretation process. Natural Language Processing (NLP) techniques that tackle disambiguation usually assume the presence of rich linguistic information. However, users are used to keyword search queries (a.k.a., Web search queries), sets of words which are a projection of the actual information need that Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). they are expressing [4]. For example, a user could type "island Java" to look for any information related to the island of Java. Keyword search queries exhibit their own language structure [3]; thus, they require specific methods to disambiguate the words in such a scenario, where rich linguistic information might not be available.
Regarding word and meaning representation, recent advances in NLP have focused on different word embedding models [14], which are a set of language modeling and feature learning techniques where elements from a vocabulary are mapped into a vector space capturing their distributional semantics [9]. However, one of their main limitations is that the possible meanings of a word are combined into a single representation. Such a limitation has been tackled so far by: 1) representing individual meanings of words as distinct vectors in the space (e.g., sense2vec), and 2) more recently, by adopting contextual word embeddings [13]. Regarding our problem, the first approach allows us to control the represented meanings, but there exist scenarios where we cannot know all the different senses at training time. Contextual word embeddings models, the second approach, are proved to capture complex characteristics of word use such as polysemy. These models heavily rely on the structure of a sentence; however, this is not the case of keyword input scenarios where the user introduce a set of words without any evident structure or even in an arbitrary order. In particular, these models assign very different vectors to the same word when appearing in a well-formed sentence and in a keyword search even when they would have the same meaning. To illustrate this issue, Table 1 shows several examples for the words 'Java' and 'Apache' using BERT [8] contextual word embeddings.
In this paper, we propose a keyword disambiguation method which is based on the semantic relatedness between words and ontological terms. Our proposal maps a set of input words to their appropriate meanings in a given target ontology, extending our previous works on semantic relatedness [10] and disambiguation [11] to exploit the semantic information captured by word embeddings. Being completely decoupled from the target ontology makes our approach easily adaptable to any domain: it only requires a specific embedding model (unsupervised) trained for such domain, which is easy to obtain from a domain document corpus. To evaluate the performance and flexibility of our approach, we have carried out a thorough experimentation in the context of Word Sense Disambiguation (WSD) in general domains, and Concept Linking in a more specific domain (clinical knowledge).
The rest of the paper is structured as follows. Section 2 discusses related works. In Section 3 we describe our semantic relatedness measure approximation. In Section 4 we present the disambiguation algorithm that we use, and Section 5 focuses on our experimental results. Finally, our conclusions and future work appear in Section 6.

RELATED WORK
Semantic Relatedness and WSD. Semantic relatedness is the degree in which two objects are related by any kind of semantic relationship [5] and lies at the core of many applications in NLP (such as WSD, or Concept Linking).
Regarding WSD methods, supervised and knowledge-based approaches are typically used. Supervised approaches make use of sense-annotated training data and exploit linguistic features from corpora as training information. However, one important drawback is that they strongly depend on a sense annotated corpora, which might not be available, and they need to be updated as the ontology evolves. Moreover, a target word or any of its senses may never be observed during training, and the system will not be able to annotate it. On the other hand, knowledge-based systems exploit linguistic properties of lexical resources to perform WSD. They usually create a graph representation of the input text to then exploit different graph analysis algorithms over it (e.g., PageRank). To the best of our knowledge the two SOTA knowledge-based systems are UKB [1] and KEF [16]. However, they heavily depend on generic lexical relationships that are difficult to find in general ontologies.
Semantics and Embeddings. Depending on how they model meaning and where they obtain it from, embedding techniques providing meaning-aware word vectors can be classified in: 1) contextual word embeddings [13], which are unsupervised models that induce word senses from huge text corpora by analyzing their contextual semantics 1 , and 2) knowledge-based methods which exploit sense inventories of lexical resources. Unfortunately, with contextual word embeddings, we cannot target a particular ontology as we do not have control over the concept detail/granularity, and the learned senses might not be aligned to any particular human-readable structure. In addition, they depend heavily on sentence structure, which render them not suitable for keyword inputs, where word omission and order alterations are frequent. Regarding knowledge-based embedding methods, they require to know all the senses at training time, not being easily adaptable to new scenarios (e.g., addition/deletion of senses, evolving ontologies, etc.). Thus, we aim at requiring neither re-training nor newly labelled data, while being capable to disambiguate words against any sense repository.
with 0 ( , ) measuring the relatedness of different synonyms of and , 1 ( , ) measuring the relatedness of ( ) and , and being a parameter which governs how their values are blended.
is the relatedness between words; ..} are the terms of the ontological context of 2 ; and 0 and 1 are the aggregation functions applied to the sets of and 0 values, respectively (we advocate to use average or maximum functions, see Section 5) 3 . Thus, the ontological term is characterized by considering two levels of its semantic description: the term label and its synonyms (Equation 2), and its ontological context (Equation 3).
The original proposal for in [10] was the Normalized Web Distance NWD(x,y), a generalization of the Cilibrasi and Vitányi's Normalized Google Distance NGD(x,y) [7] to use any Web search engine as source of frequencies. NWD ranges from 0 to ∞, to obtain a relatedness measure in the range [0,1] increasing inversely, the following transformation was applied: To exploit word embeddings, substituting ( , ) by the cosine distance, as broadly adopted, was not possible as its values range in [−1,1]. So, to obtain measure in the range [0,1], we propose to use the angular distance instead, which is computed as follows: Thus, we substitute Equation 4 by Equation 5 directly in Equation 2 (we validate this substitution experimentally in Section 5.1). For those cases in which the label of the ontological term is multi-word, we compute the centroid as it is broadly adopted.

KEYWORD DISAMBIGUATION
Extending the distributional hypothesis [9], our hypothesis is that the most significant words in the disambiguation context are the most highly related to the word to disambiguate; such words conform its active context, . More formally, let be an element of an input sequence of words S, K ⊆ S be the set of all possible keywords 2 Notice that | ( ) | ≥ 1 and | ( ) | ≥ 0. 3 In the original formulation, the average was proposed, but we generalize it to explore the influence of the linkage used between the sets. in the input, ⊆ K the set of keywords of the disambiguation context, and ∈ K the target keyword to disambiguate. Thus: Definition 1. Given a context C ∈ K, and a keyword to disambiguate ∈ K, the active context of is a subset ⊆ such that ∀ ∈ , ∈ ( − ) ∋ ( , ) > ( , ).
To obtain the active context ⊆ of , we: 1) remove repeated words and stopwords from C, 2) apply the semantic relatedness ( in our case) between each keyword ∈ and , and 3) construct with ∈ whose relatedness scores above a certain threshold 4 . Disambiguating the keywords. We ground on our algorithm proposed in [11]. This algorithm 5 takes as input , , and a set of possible senses for , , and performs three main steps: 1) obtaining the semantic relatedness between and each candidate sense ∈ , 2) calculating the overlap between and ( ) for each ∈ , and 3) re-ranking according to their frequency of use (when such information is available). The output is a score for each sense ∈ that represents the confidence level of being the right sense. Note that is not restricted to any particular dictionary, as it could be dynamically built from, e.g., different ontological resources. Algorithm extensions. Apart from using the updated within the relatedness formulae, we modify the second step to exploit word embeddings-based representations instead of using the overlap between the active context of the keyword being disambiguated and the ontological context of a term . We have studied the following methods to capture the influence of the contexts: Average: This method calculates the average vector of the different bag of words involved in the disambiguation, under the assumption that the semantically coherent groups of words should outstand from the others. Thus, this method computes the average relatedness between the word vectors from and ( ). Smooth Inverse Frequency (SIF): Arora et al. [2] proposed to represent a sentence by a weighted average vector of its word vectors, from which the most frequent component obtained using PCA/SVD is substracted. This component may encompass words that occur frequently in a corpus and lack semantic content, thus not contributing to the disambiguation. We propose to compute a new score for each sense in by measuring the distance between the centroid of the active context and the SIF vector of the ( ). Top-K Nearest Words: We apply the same active context hypothesis to : the words in the of a sense which are the closest ones to and should be the most significant for the disambiguation. Thus, from each ( ), we select the top-k nearest keywords to ∪ . Then, we compute the distance between the centroids of and of the top-keywords of ( ) to obtain a new score. We explored the performance of these methods to rule out non appropriate ones. We report the results obtained in the next section.

EXPERIMENTAL EVALUATION
In order to validate our proposal, we have carried out three main sets of experiments. Due to the lack of space, we only present here the overall conclusions obtained from the results 6 .

Correlation with Human Judgment
We analysed the correlation of the angular distance with human judgment in a basic word-to-word comparison. For this purpose, we used different datasets available in the literature (see the summary presented by Lastra et. al [12]) and we used 12 different pre-trained word embeddings built with different techniques.
Results: In general, using the angular distance to calculate relatedness between pairs of words offers semantic correlation with the human judgment. It varies depending on the dataset and the model used, but it achieves an average 59.79% of Spearman correlation (SD 17.73), whereas cosine distance achieves an average of 61.82% (SD 16.58). Thus, despite a small decrease of correlation with human judgment compared to the cosine distance (broadly adopted in the literature), these results allow us to use this measure for our purposes.

Word Sense Disambiguation Evaluation
To evaluate our proposal in a general domain, we used two datasets: SemEval 2013, and SemEval 2015, and WordNet as target ontology. As these datasets contain sentences in natural language and our proposal is focused on keyword-based inputs, we built two additional setups: 1) starting from each natural language dataset, we derived a dataset by keeping the words of the most usual types in keyword expressions (noun, adjectives, and verbs) 7 , and 2) to restrict even further the input, we derived a dataset from each of these latter ones by taking groups of three consecutive keywords. Regarding embeddings, we used the + 2 [6] vectors since in preliminary tests we saw that they offered the best results. Finally, we built the for each concept including its synonyms, hypernyms, and hyponyms. Besides, we also included their glosses (available in WordNet). The best configuration was achieved using the average aggregation function, and the Top-K Nearest Words method. We also witnessed that giving more importance to the improved the results ( = 0.25). Results: We compare our best configuration with the current SOTA systems in WSD, specifically with: the supervised system proposed by Vial et al. [15], UKB [1], and KEF [16]. Table 2 presents the results obtained for all the systems.
Vial et al. [15] is compromised as we transform the input into a keyword expression. Although we did not outperformed its results, we are close in the SemEval 2013 dataset. Note that this system requires an annotated dataset, which might not be available in the target domain. Regarding UKB, it suffers when dealing with short inputs, where it does not have enough information. In such a setup, our proposal gets better results in SemEval 2013, and close ones in SemEval 2015. Besides, note that this approach strongly depends on the availability of lexical relationships, difficult to find in non-lexical domain ontologies. Regarding KEF, it remains stable in any case. However, this comes at the cost of additional computational cost 8 and, as UKB does, it heavily depends on lexical knowledge, so it is not portable to other domains with different ontologies. To sum up, our proposal achieves good performance in general domain scenarios where the linguistic information is scarce. In addition, it provides flexibility to work independently of the resources used.

Concept Linking Evaluation
To test the domain flexibility of our proposal, we performed an evaluation in the task of Concept Linking in the Health domain. For this, we used the dataset provided in the Task 1 of the ShaRE/CLEF eHealth  2013 Evaluation Lab. We addressed the subtask b which consists of mapping annotated disorder mentions to SNOMED-CT concepts included in UMLS. We used ElasticSearch 9 to index all the concepts which gave us a syntactic baseline to compare to. In this setup, we used all the mentions appearing in each document as the context of the mention to be disambiguated. For each mention, we retrieved N (set to 10) candidate concepts from ElasticSearch and we ran our disambiguation method. As word embeddings, we used a publicly available w2v model 10 trained on PMC&PubMed corpus. Results: The best method was Top-K Nearest Words, along with the maximum aggregation function. In this case, hypernyms did not contribute as much, and the best results were obtained when the OC contained synonyms and hyponyms. Table 3 reports the precision (P) and the precision at 3 (P@3) results achieved. Ranking semantically the candidate concepts improved strongly the syntactic baseline results. We also noted the increasing performance in this scenario as we increased : in this ontology, concepts are very close to each other both syntactically and semantically, so, it is more important to give more weight to synonymy. Summing up, our proposal improves this concept linking task by helping disambiguating the terms in a different domain without any particular training for that, which shows the flexibility of our approach.

CONCLUSIONS AND FUTURE WORK
In this paper we have presented a keyword disambiguation approach based on a semantic relatedness measure which exploits the semantic information captured by word embeddings, capable of mapping words to their meanings within a given ontology. With our proposal: • We provide a method to calculate the semantic relatedness between words and concepts of an ontology. • We are able to disambiguate keyword-based inputs, where the linguistic information is scarce, and link them to concepts from an ontology in a flexible way. Our proposal can be adapted to any domain in a dictionary-decoupled way, lowering the potential data requirements (no annotated data is required).
• We have evaluated our proposal via thorough experimentation in general and specific domains, competing with current SOTA approaches and showing the flexibility of our approach.
As future work, we want to explore how contextual word embeddings [13] could be used in this context. Moreover, given the existing differences between general and specific domains (where concepts are both syntactic and semantically closer), we want to explore how we could take into account the syntactic and semantic features of the concepts to adapt dynamically our proposal to the scenario tackled.