Phrase-Level Metaphor Identification Using Distributed Representations of Word Meaning

Metaphor is an essential element of human cognition which is often used to express ideas and emotions that might be difficult to express using literal language. Processing metaphoric language is a challenging task for a wide range of applications ranging from text simplification to psychotherapy. Despite the variety of approaches that are trying to process metaphor, there is still a need for better models that mimic the human cognition while exploiting fewer resources. In this paper, we present an approach based on distributional semantics to identify metaphors on the phrase-level. We investigated the use of different word embeddings models to identify verb-noun pairs where the verb is used metaphorically. Several experiments are conducted to show the performance of the proposed approach on benchmark datasets.


Introduction
Metaphor is a stylistic device used to enrich the language and represent abstract concepts using the properties of other concepts. It is considered as an analogy between a tenor (target concept) and a vehicle (source concept) by exploiting common similarities. The sense of a concept such as "harmful plant" can be transferred to another concept's sense such as "poverty" by exploiting the properties of the first concept. This then can be expressed in our everyday language in terms of linguistic metaphoric expressions such as "...eradicate poverty", "...root out the causes of poverty", or "...the roots of poverty are..." 1 (Lakoff and Johnson, 1980;Veale et al., 2016). In this work, a word or an expression is a metaphor if it has at least one basic/literal sense (more concrete, physical) and a secondary metaphoric sense (abstract, 1 These examples could be found in the United Nations Parallel Corpus (Ziemski et al., 2016). non-physical) which resonates semantically with the basic sense (Steen et al., 2010;Hanks, 2016).
Metaphor processing is one of the most challenging problems for many natural language processing tasks such as machine translation, text summarization and text simplification. Moreover, metaphor processing could be helpful for wider applications such as political discourse analysis (Charteris-Black, 2011) and psychotherapy (Witztum et al., 1988;Gutiérrez et al., 2017).
Understanding metaphors requires deeper levels of language processing that go beyond the sentence surface level. Among the main challenges of the computational modelling of metaphors is their pervasiveness in language which makes them occur frequently in everyday language. Moreover, metaphors are often conventionalised to such an extent that they exhibit no defined lexical patterns or signals. Previous approaches relies on extensive lexical resources to identify metaphors and to capture their semantic features. Feature extraction from an annotated corpus is a challenge as well, not only due to the complexity of the task itself but also due to the lack of high quality annotated corpora. The process of creating such a corpus depends on the task definition as well as the targeted application and often requires significant effort and time.
In this paper, we introduce a semi-supervised approach that makes use of distributed representations of word meaning to capture metaphoricity. We focus on identifying verb-noun pairs where the verb is used metaphorically. We extract verb-noun grammar relations using the Stanford parser (Chen and Manning, 2014). We then employ pre-trained word embeddings models to measure the semantic similarity between the candidate and a predefined seed set of metaphors. A similarity threshold, which was optimised on a sample dataset, is used to classify the given candidate. Evaluation of the presented approach was carried out on various test sets using different word embeddings algorithms. Additionally, a performance comparison is carried out against the results of the stateof-the-art approach on benchmark datasets.

Related Work
One of the most common tasks of the computational processing of metaphors is "metaphor identification" which is concerned with recognising (detecting) the metaphoric expressions in the input text. Metaphor detection could be done on the word-level (token-level) or on the phrase-level by extracting grammatical relations.
In this paper, we are interested in phrase-level linguistic metaphor detection, focusing on verbnoun phrases (grammatical relations) by employing semantic representation of word meaning. Therefore, due to space limitation, we will discuss the most relevant research in this regard in this section. An extensive literature review is presented in (Zhou et al., 2007;Shutova, 2015). Some recent work on metaphor detection has been looking into the utilization of semantic representations through word embeddings representations to design supervised systems for metaphor detection (Rei et al., 2017;. Our approach also utilises such representations but in a semi-supervised manner to avoid the need for large training corpora. Rei et al. (2017) introduced a neural network architecture to detect adjective-noun and verb-noun metaphoric constructions. Their system comprises three main components which are: word gating, vector representation mapping and a weighted similarity function. The word gating is used to model the association between the properties of the source and target domains which is done via a non-linear transformation of the word embeddings vectors of the given candidate pair. The word embeddings used in this step are obtained from a pretrained model. Then, a vector representation mapping is carried out to prepare a "new metaphorspecific" vector space using the original word embeddings. Finally, a weighted cosine similarity function is used to automatically select the important vector dimensions for the metaphor detection task. The authors experimented with different pretrained word representations, namely skip-gram model and an attribute-based model. Two different datasets, which were referred to as the TSV dataset (Tsvetkov et al., 2013) and the MOH dataset (Mohammad et al., 2016), were used to train the system and optimise its parameters as well as to assess its performance.  is a recent approach that investigated whether property-based semantic word representation can provide better concept generalisation for detecting metaphors than dense linguistic representation. The authors proposed property-based vectors through cross-modal mapping between dense linguistic representations and a property-norm semantic space. The authors built a count-based distributional vector and employed a skip-gram model trained on Wikipedia articles as their dense linguistic representations. The property-norm semantic space is obtained from the property-norm dataset (McRae et al., 2005). The TSV dataset is used to train and test a support vector machine (SVM) classifier to classify adjective-noun pairs using the introduced cognitively salient properties as features.
An interesting approach, which employed multi-model embeddings of visual and linguistic features to detect metaphoricity in text, is introduced by . The proposed approach obtained linguistic word embeddings using a log-linear skip-gram model trained on Wikipedia text and obtained visual embeddings using a deep convolutional neural network trained on image data. This was done for both the words and phrases of adjective-noun and verb-noun pairs individually. Then, the cosine similarity function has been employed to measure the distance between the phrase vector and the corresponding vectors of its constituent words. Metaphor classification is done based on an optimised threshold output of the cosine similarity function. The authors used the TSV and the MOH datasets to train and test their system in addition to optimising the classification thresholds.
Modelling metaphor in a distributional semantic space through linear transformation to improve vector representation has been investigated by Gutiérrez et al. (2016). The authors introduced a compositional distributional semantic framework to identify adjective-noun metaphoric expressions.
A variety of lexical and semantic features including lexical abstractness and concreteness, imageability, named entities, part-of-speech tags, and the word's supersenses 2 using WordNet (Fell-baum, 1998) have been employed to develop supervised systems to detect metaphors (Köper and Schulte im Walde, 2017;Tsvetkov et al., 2013;Hovy et al., 2013;Turney et al., 2011). Shutova et al. (2010) was among the earliest approaches to computational modelling of metaphor, avoiding task-specific hand-crafted knowledge and huge annotated resources. They introduced a semi-supervised approach to identify verb-noun metaphors using corpus-driven distributional clustering. Their strategy is based on clustering abstract nouns based on their contextual features in order to capture the metaphorical senses associated with the source concept. The system exploits a small set of metaphoric expressions as a seed to detect metaphors in a semi-supervised manner. In a follow-up work, Shutova and Sun (2013) investigated the use of hierarchical graph factorization clustering to derive a network of concepts in order to learn metaphorical associations in an unsupervised way which then was used as features to identify metaphors. We consider the work introduced by Shutova et al. (2010) as a baseline for our proposed approach, thus we are going to explain its reimplementation details in subsection 3.3. Birke and Sarkar (2006) introduced TroFi, which is considered the first statistical system to identify the metaphorical senses of verbs in a semi-supervised way. The authors adapted a statistical similarity-based word sense disambiguation approach to cluster literal and non-literal senses. A predefined set of seed sentences is utilised to compute the similarity between a given sentence and the seed sentences.

Methodology
The idea behind our approach is based on finding synonyms and near-synonyms of metaphors. Our approach employs vector representation and semantic similarity to classify verb-noun pairs extracted from a sentence using a parser as potential candidates for metaphoric classification. A candidate is classified as a metaphor or not by measuring its semantic similarity to a predefined small seed set of metaphors which acts as our existing known metaphors sample. Metaphoric classification is performed based on a previously calculated similarity threshold value on a development dataset. The following subsections explain the hypothesis behind this work and our proposed approach in addition to the reimplementation of the state-of-the-art semi-supervised system used as our baseline system.

Hypothesis
Our hypothesis in this work is that a given candidate should have common characteristics and semantic features with some positive examples of metaphors. However, simply calculating the similarity between a given verb-noun candidate and a metaphoric seed is not enough due to the effect of each of the verb and the noun on the overall similarity score. For example, consider a metaphoric seed such as "break agreement" and two given candidates such as "break promise" and "break glass". The semantic similarities between the word embeddings vectors of the seed and the two candidates measured by the cosine similarity function are 0.5304 and 0.6376, respectively, using a pre-trained Word2Vec (Mikolov et al., 2013) word embedding model on the Google News dataset. This indicates that both candidates are similar to the seed and there is not enough information to tell which one should be classified as a metaphor. Table 1 shows the similarity values of the two candidates and the most similar metaphoric seeds from the predefined seed set. We decided to look into the individual words of the candidate considering the fact that semantically similar or related words will be placed near each other in the embeddings space while unrelated words will be far apart. Therefore, we expect that the noun "promise" will be in the neighbourhood of "agreement" in the semantic space, while "glass" will not. So if both candidates share similar verbs, classification could be done based on the similarity of the nouns; in that case, "break promise" can be classified as metaphor due to the vicinity of its noun to the noun of the metaphoric seed while "break glass" will not. Since using one positive (metaphoric) example is not enough for precise classification, we used a small set of verb-noun pairs, hereafter referred to as the seed set, where the verb is used metaphorically. The specification of the seed set will be explained in detail in section 4.

Approach
We start with the seed set of metaphoric verb-noun pairs as S = {(V, N )}. Given a target verb-noun candidate (v t , n t ) that needs to be classified, we calculate the distance between every verb v s in S and the verb of the candidate v t using the cosine distance measure as follows:  Table 1: The cosine similarity between the candidates "break promise" and "break glass" and the top 10 metaphoric seeds in the seed set using a pre-trained Word2Vec word embedding model on Google News dataset.
gives a list of verbs ranked according to the distance to the verb of the candidate; we then select the top n nearest verbs and we get the nouns associated with them in the seed set as follows: Y vt = top n {n s : (v s , n s ) ∈ S} by D ts Finally, the average of the distances between these nouns and the target noun in the candidate phrase is calculated. If this average is less than a threshold δ then the candidate phrase will be classified as a metaphoric expression as follows: Table 2 shows the cosine distance between the verbs and the nouns of the candidates "break promise" and "break glass" verses the verbs and the nouns of the top 10 metaphoric seeds from the seed set using a pre-trained Word2Vec word embedding model on the Google News dataset; those 10 seeds have the most similar (nearest in terms of distance) verbs to the candidate verb.

Baseline
We consider the system introduced by Shutova et al. (2010) as our baseline system. In this subsection, we are going to explain in detail the reimplementation of this approach and the related findings. The system consists of four main components which are: a seed set, a clustering component, a candidate extraction component, and a filtering component. The seed set is obtained from the British National Corpus (BNC) (Burnard, 2009) and consists of 62 metaphoric verb-noun pairs (more details are given in section 4). Spectral clustering (Meila and Shi, 2001) is used to cluster the abstract concepts (nouns) and the concrete concepts (verbs) then an association (mapping) is drawn between the two clusters using the seed set. The candidate extraction component employs the Robust Accurate Statistical Parsing (RASP) parser (Briscoe et al., 2006) to extract verb-subject and verb-direct object grammar relations. After that, the linked clusters (through the seed set) is used to identify potential metaphoric candidates. The filtering component is finally used to filter out these candidates based on a selectional preferences strength (SPS) measure (Resnik, 1993). The verbs exhibiting weak selectional preferences are considered to have lower metaphorical potential. An SPS threshold was set experimentally to be 1.32, thus, the candidates which verbs have an SPS value below this threshold are discarded.
In our reimplementation, we employed the Stanford Parser instead of the RASP Parser to extract the grammar relations and to implement the filtering component to calculate the SPS. SPS is calculated using a simplified Resnik model which models the association of the verb (predicate) with the noun (instead of a class) from the BNC corpus. The verb clusters were originally developed using VerbNet (Schuler, 2006) and the noun clustering were developed using the 2,000 most frequent nouns in the BNC corpus. Since the clusters were obtained from a relatively small dataset we suspected that it might lead to a limited coverage, which will be later shown in the system evaluation.  Table 2: The cosine distance between the verbs and nouns of the candidates "break promise" and "break glass" verses the verbs and the nouns of the top 10 metaphoric seeds in the seed set using a pre-trained Word2Vec word embedding model on Google News dataset. This is one of the limitations of this system; a candidate is either in the clusters or not. And if the candidate's noun appeared in a noun cluster but this cluster was not mapped to the cluster where the verb occurs the candidate will be discarded.

System Architecture
As described in Figure 1 below, our system consists of three main components: a parser, a seed set of metaphoric expressions and a pre-trained word embedding model. Parser: Since our aim is to identify metaphors on the phrase-level, the Stanford parser is used to extract the grammar relations in a given sentence. We used the recurrent neural network (RNN) parser in the Stanford CoreNLP toolkit  to extract dependencies focusing on verb-subject and verb-direct object grammar relations.
Seed Set: We used the seed set of Shutova et al. (2010) to act as our set of existing known metaphoric expressions (positive examples). The seed set consists of 62 verb-subject and verb-direct object phrases where the verb is used metaphorically 3 . These seeds are extracted originally from a subset of the BNC corpus which contains 761 sentences. These sentences were annotated for grammatical relations to extract the specified grammar relations which are then filtered and manually annotated for metaphoricity. Examples of the metaphors in the seed set are "mend marriage, break agreement, cast doubt, and stir excitement".
Word Embedding Model: This work utilises distributional vector representation of word meaning to calculate semantic similarity between a candidate and a seed set. Word2Vec and GloVe (Pennington et al., 2014) are two widely used word embeddings algorithms to construct embeddings vectors based on the distributional hypothesis (Firth, 1957) but using different machine learning techniques. In this work, we investigated the effect of using different pre-trained models and similarity measures as shown in detail in the next section.

Experimental Settings
In this section, we give an overview of the experimental settings of our proposed approach and the test sets that are used to assess the performance of the methodology described above.

Models and Parameters
The utilised similarity measures, word embeddings models, and system's parameters are defined as follows: Similarity Measures: We examined two similarity measures as follows: -Cosine Distance Metric: The cosine similarity function measures the cosine of the angle between two vectors. Given the vectors u and v, the cosine distance can be defined as:  -Word Mover's Distance (WMD) (Kusner et al., 2015): could be defined as the minimum travelling distance from one word embeddings vector to the other.
Embeddings Models: We experimented with two different pre-trained vector representations of word embeddings which are: -Word2Vec Google News 4 : The model is trained on about 100 billion words from the Google News dataset and contains 300dimensional vectors for 3 million words using the approach described in (Mikolov et al., 2013). The model is based on the skipgram neural network architecture which employs the negative sampling training algorithm and sub-sampling frequent words using a window-size of 10.
-GloVe Common Crawl 5 : We used a pretrained model on the Common Crawl dataset containing 840 billion tokens of web data (about 2 million words). The vectors are 300dimensional using 100 training iteration.
For simplicity, we used a single vector representation for each word ignoring multi-word combina-tions such as phrasal verbs, examples of which include e.g. "hold back, flip through"; we are planing to address this issue in the future. System's Parameters: We performed experiments on a development set to select the values of the parameters top n and δ mentioned in subsection 3.2. The best value obtained for n is found to be top 10 nearest verbs. The suitable distance average threshold δ is found to be 0.80 for the GloVe Creative-Commons-840 model and 0.85 for the Word2Vec Google-News model. These values give a good trade-off between false positives and false negatives.

Test Sets
Two different test sets are used to evaluate our approach as follows: VUA Test Set: We use a subset of the training verbs dataset from the VU Amsterdam Metaphor Corpus (VUA) (Steen et al., 2010) provided by the NAACL 2018 Metaphor Shared Task 6 . The original VUA corpus is a subset of the BNC Baby corpus consists of 117 texts covering various genres which are academic, conversation, fiction, and news. Although the dataset is annotated on the token-level, its availability and the fact that it is already annotated encouraged us to use it for assessing our approach. The verbs dataset consists of around 17,240 annotated verbs; we retrieved the original sentences of these verbs from the VUA corpus, which yielded around 8,000 sentences. We then parsed these sentences using the Stanford Parser and extracted around 5,000 verb-direct object relations. Arbitrary 300 verb-noun pairs (160 positive and 145 negative examples) are selected to be our test set where the verb is used metaphorically or literally. Table 3 shows some examples from this test set.
MOH dataset:  introduced a manually annotated dataset of verb-subject and verb-object pairs. The dataset has been referred to as MOH as it was originally obtained from Mohammad et al. (2016) who annotated different senses of verbs in WordNet for metaphoricity. Verbs were selected if they have more than three senses and less than ten senses. Then the example sentences from WordNet for each verb were extracted and annotated by 10 annotators using crowd-sourcing. In a next step, the verb-subject and verb-direct object grammar relations were extracted out of the original dataset. The final dataset consists of 647 pairs out of which 316 instances are metaphorical and 331 instances are literal.

Metaphor
Not Metaphor reveal approach collect passport break corporation use power make money abolish power see language perform shuffle make error decorate wall face criticism put stage give access read book lay foundation research joke make time tell story abuse status give key

Evaluation
In this section, we evaluate our approach using different test sets, pre-trained word embeddings models and similarity measures. Additionally, we compare the performance of our approach against the baseline system explained in subsection 3.3. We used four standard evaluation metrics, namely precision, recall, F-score and accuracy.

Results
We applied our system to the three test sets introduced above and compared it to the defined baseline system. Table 4 shows the results of the experiment carried out on the VUA test set. It also shows the results obtained from the baseline system. Table 5 shows the performance of our system on the whole MOH dataset.

Discussion and Analysis
It can be seen from the results above that our approach performs better using GloVe as the pretrained word embedding model and using cosine distance as the similarity metric. It is also noted that the system suffers from a low recall when using the Word2Vec model with the cosine distance function. This might be due to the limited coverage of the seed set where the top 10 most similar metaphors are not enough to detect new candidates of metaphors. We manually examined our system's output on the MOH dataset. Our system was able to correctly detect metaphoric expressions such as "absorb knowledge, attack cancer, blur distinction, buy story, capture essence, swallow word, visit illness, wear smile" as well as literal ones such as "attack village, build architect, leak container, steam ship, suck poison". Some of the false positives, where our system detection was metaphor while the gold label was not, include "ascend path, blur vision, buy love, communicate anxiety, jam mechanism, lighten room, line book, push crowd" which could be regarded as metaphors depending on the context. Our system was able to spot some inconsistency in the annotations of the VUA test set. For example, the verb-noun pair "win election" is detected as metaphor by our system while we realised that it has 3 different annotations across the rest of the VUA dataset (the verb "win" annotated once as a metaphor and twice as not metaphor while having "election" as its direct object). Additionally, in the VUA corpus the verb "win" is annotated as metaphor with similar abstract concepts such as in "win match" and "win bid". This is one of the differences between preparing a dataset for word-level detection as the VUA corpus or preparing a dataset for phrase-level detection. Moreover, it shows that a verb-noun pair may or may not be metaphoric based on the context. Also, it highlights the minor differences in the views of the   definition of metaphor itself between Lakoff and Johnson (1980) and Steen et al. (2010), which in turn emphasises that the metaphorical sense does not depend solely on the properties of individual words (Gutiérrez et al., 2016).
The results also indicate that the baseline system has a very low recall on the introduced test sets. The reason behind that, as mentioned in subsection 3.3, is that it utilises clusters developed using the BNC corpus, which likely limit the coverage of the system adding into account the limitation of the small seed set (as in our approach). For example, out of the 300 pairs in the VUA test set only 7 candidates were included in the final classification as the rest of the words were not seen before in the clusters. Similarly, out of the 647 pairs in the MOH dataset only 4 were able to be recognised as candidates.
Our system's performance could be improved by increasing the size of the seed set and optimising the system's parameters accordingly (which we are planing to address in the future). In order to investigate this point, we did an additional experiment using 10-fold cross-validation of the MOH dataset in which we included 10 different splits from the dataset as our seed set of metaphors. The best results in terms of precision, recall, F-score, and accuracy are 0.5945, 0.756, 0.6657, and 0.6290, respectively. These results are obtained using the GloVe word embedding model pre-trained on the Common Crawl dataset and the cosine distance as similarity function with the same parameters values. In this experiment, we noticed that the values of n and the threshold δ should be adapted according to the increase in the number of seeds.
We did not to compare our results to  or Rei et al. (2017) as these systems are not directly comparable to ours.  is using a different test split from the MOH dataset to evaluate their system. Moreover, both works proposed fully supervised approaches in which they utilise negative (literal) examples as well as positive (metaphoric) examples to train their systems, whereas our approach is semi-supervised (similar to (Shutova et al., 2010)) which uses only the positive (metaphoric) examples. Therefore, carrying out a performance comparison will be imperfect.

Conclusion and Future Work
In this work, we presented a semi-supervised approach to detect metaphors using distributional representation of word meaning. Different word embeddings models have been investigated to identify phrase-level metaphors focusing on verbnoun expressions. The system utilises a predefined seed set of metaphoric expressions to detect unseen metaphoric expression(s) in a given sentence. As discussed, in contrast to other stateof-the-art approaches, our proposed approach employs fewer lexical resources and does not require annotated datasets or highly-engineered features. This gives it a flexibility to be easily adapted to new languages or text types. We have performed several experiments to assess the performance of our approach on benchmark datasets. As part of our future work, we are planning to investigate the effect of increasing the number of seeds on the system's coverage and to extend this approach to detect other metaphoric syntactic constructions taking into account multi-word expressions such as phrasal verbs.