Concept Extraction Using Pointer-Generator Networks

Concept extraction is crucial for a number of downstream applications. However, surprisingly enough, straightforward single token/nominal chunk-concept alignment or dictionary lookup techniques such as DBpedia Spotlight still prevail. We propose a generic open-domain OOV-oriented extractive model that is based on distant supervision of a pointer-generator network leveraging bidirectional LSTMs and a copy mechanism. The model has been trained on a large annotated corpus compiled specifically for this task from 250K Wikipedia pages, and tested on regular pages, where the pointers to other pages are considered as ground truth concepts. The outcome of the experiments shows that our model significantly outperforms standard techniques and, when used on top of DBpedia Spotlight, further improves its performance. The experiments furthermore show that the model can be readily ported to other datasets on which it equally achieves a state-of-the-art performance.


Introduction
In knowledge discovery and representation, the notion of concept is most often used to refer to sense, i.e., 'abstract entity' or 'abstract object' in the Fregean dichotomy of sense vs. reference [9]. In Natural Language Processing (NLP), the task of Concept Extraction (CE) deals with the identification of the language side of the concept coin, i.e., Frege's reference. Halliday [15] offers a syntactic interpretation of reference. In his terminology, it is a "classifying nominal group". For instance, renewable energy or nuclear energy are classifying nominal groups: they denote a class (or type) of energy, while, e.g., cheap energy or affordable energy are not: they do not typify, but rather qualify energy (and are thus "qualifying nominal groups").
CE is crucial for a number of downstream applications, including, e.g., language understanding, ontology population, semantic search, and question answering; it is also the key to entity linking [21]. In generic open domain subjectneutral discourse across different (potentially unrelated) subjects, indexing the longest possible nominal chunks and their head words located in sequences of arXiv:2008.11295v1 [cs.CL] 25 Aug 2020 tokens between specified "break words" [32] and special dictionary lookups such as DBpedia Spotlight [5] and WAT [26] are very common techniques. They generally reach outstanding precision, but low recall due to constant evolvement of the language vocabulary. Advanced deep learning models that already dominate CE in specialized closed domain discourse on one or a limited range of related subjects, e.g., biomedical discourse [13,31], and that are also standard in keyphrase extraction [2,24] are an alternative. However, such models need a tremendous amount of labeled data for training.
We present an operational CE model that utilizes pointer-generator networks [28] and bidirectional long short-term memory (LSTM) units [11] to retrieve concepts from general discourse textual material. 3 Furthermore, since for a generic, domain-independent concept extraction model we need a sufficiently large training corpus that covers a vast variety of topics and no such annotated corpora are available, we opt for distant supervision to create a sufficiently large and diverse dataset. Distant supervision consists in automatic labeling of potentially useful data by an easy-to-handle (not necessarily accurate) algorithm to obtain an annotation which is likely to be noisy but, at the same time, to contain enough information to train a robust model [25]. Two labeling schemes are considered. Experiments carried out on a dataset of 250K+ Wikipedia pages show that copies of our model trained differently and joined in an ensemble significantly outperform standard techniques and, when used on top of DBpedia Spotlight, further improve its performance by nearly 10%.

Related work
In this section, we focus on the review of generic discourse CE; for a comprehensive review of the large body of work on specialized discourse CE, and, in particular, on biomedical CE; see, e.g., [14]. We also do not discuss recent advances in keyphrase extraction [2] because their applicability to generic concept extraction is limited due to specificity of the task.
The traditional CE techniques interpret any single and multiple token nominal chunk as a concept [32] or do a dictionary lookup, as, e.g., DBpedia Spotlight [5], which matches and links identified nominal chunks with DBpedia entries (6.6M entities, 13 billion RDF triples) 4 , based on the Apache OpenNLP 5 models for phrase chunking and named entity recognition (NER). Given the large coverage of DBpedia, the performance of DBpedia Spotlight is rather competitive. However, obviously, the presence of an entry cannot always be ensured. Consider, e.g., a paper title "Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing", where DBpedia Spotlight does not detect "Bloom embeddings" or "incremental parsing", as there are no such entries in DBpedia.
As DBpedia Spotlight, AIDA [33] relies on an RDF repository, YAGO2. WAT and its predecessor TagMe [26] use a repository of possible spots made of wikianchors, titles, and redirect pages. Both TagMe and WAT rely on statistical attributes called link probability and commonness; WAT draws furthermore on a set of statistics to prune a set of mentions using an SVM classifier. Wikifier [3] focuses on relation extraction, relying on a NER, which uses gazetteers extracted from Wikipedia and simple regular expressions to combine several mentions into a single one. All of them are used for state-of-the-art entity linking and (potentially nested) entity mention detection and typing [16,34]. FRED [10] also focuses on extraction of relations between entities, with frames [8] as the underlying theoretical constructs. Unlike Wikifier and FRED, e.g., OLLIE [23] does not rely on any precompiled repository. It outperforms its strong predecessors REVERB [7] in relation extraction by expanding the set of possible relations and including contextual information from the sentence from which the relations are extracted.
A number of works focus on the recognition of named entities, which are the most prominent concept type. NERs work at a sentence level and aim at labeling all occurred instances. Among them, Lample et al. [19] provide a stateof-the-art NER model that avoids traditional heavy use of hand-crafted features and domain-specific knowledge. The model is based on bidirectional LSTMs and Conditional Random Fields (CRFs) that rely on two sources of information on words: character-based word representations learned from an annotated corpus and unsupervised word representations learned from unannotated corpora. Another promising approach to NER is fine-tuning of a language representation model such as, e.g., BERT [6]. The pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, including NER, without substantial task-specific architecture modifications.

Description of the model
We implement a deep learning model and a large-scale annotation scheme for the distant supervision to cope autonomously with dictionary-independent generic CE and to a possible extent complement present lookup-based approaches to increase their recall. In addition, we would like our model to perform decently on pure NER tasks with a small gap to models specifically tuned for the NER datasets. The model follows the well-established tendency in information extraction adopted for NER and extractive summarization and envisage CE as an attention-based sequence-to-sequence learning problem.

Overview of the model
As a basis of our model, we use the pointer-generator network proposed in [28] that aids creation of summaries with accurate reproduction of information. In each generation step t, the pointer allows for copying words w i from the source sequence to the target sequence using distribution of attention layer a t , while the generator samples tokens from the learned vocabulary distribution P vocab , conditioned by a context vector h * t produced by the same attention layer which is built based on hidden states h i of an encoder and states s t of a decoder (in each case, a bidirectional LSTM [11]). In addition, coverage mechanism is applied to modify a t using a coverage vector c t to avoid undesirable repetitions in the output sequence. Specifically, to produce a word w, the above-mentioned distributions are combined into a single final probability distribution being weighted using the generation probability p gen ∈[0,1]: where P vocab (w) is the vocabulary distribution, which is zero if w is an out-ofvocabulary (OOV) word; a t is the attention distribution; w i -tokens of the input sequence; i:wi=w a t i is zero if w does not appear in the source sequence. According to [28], individual vectors, distributions, and probability p gen are defined as follows: where T stands for the transpose of a vector, x t is the decoder input, and σ is the sigmoid function.
To adapt this basic model to the task of CE, we applied several modifications to it (cf., Figure 1 6 ): (i) following Gu et al. [12], we use separate distributions for copy attention and general attention, instead of one for both; (ii) experiments have shown that encoders and decoders with several LSTM layers perform better than with a single layer, such that we work with multiple layer LSTMs; how many is determined using a development dataset; (iii) we adapt the forms of input and target sequences to the specifics of the task of CE. The input is comprised of tokens and their part-of-speech (PoS) tags (e.g., 'The DT President NN is VBZ elected VBD by IN a DT direct JJ vote NN'). The target sequence concatenates concepts in the order they appear in the text and separates them by a token "*" especially introduced to partition the output (e.g., 'President * direct vote').
This model is naturally applicable to the task of CE since it facilitates the selection and transfer of subsequences of tokens (= concepts) from a given source sequence of tokens (= text input) to the target sequence (= partitioned sequence of concepts). The pointer mechanism implies the ability to cope with OOV words, which is crucial for universal CE, while the generator implies the ability to adjust vocabulary distribution for selecting the next word (which might be a termination token "*") based on a given context vector, which allows us to implicitly take into account the domain specifics and linguistic features that facilitate the task of CE. Furthermore, the updating of vocabulary distribution adds the possibility to vanish or strengthen the copy effect and thus learn to distinguish concepts with outer modifiers (such as, e.g.,"hot air ", "[fully] crewed aircraft", "reinforced group") from multiword concepts (such as, e.g., "hot air balloon", "unmanned aerial vehicle", "reinforced concrete").

Training and applying the model
For training, token sequences are taken from annotated sentences (see the compilation of the annotated training dataset in Section 4.2 below) with a sliding overlapping window of a fixed maximum length (see the Experiments section), which is minimally expanded if needed in order not to deal with incomplete concepts at the borders. The trained model is applied to unseen sentences, which are also split into sequences of tokens with an overlapping window of the same size, without any expansion. Finally, the corresponding mentions in the plain text are determined since the output format does not include offsets. In particular, following [16], we find all possible matches for all detected concepts and then successively select non-nested concepts from the beginning to the end of the sentence, giving priority to the longest, in case of a multiple choice.

Datasets
In what follows, we describe the data and the procedure for their weak annotation to create extensive training and test datasets.

Data
We take a snapshot of the WordNet synset-typed 7 Wikipedia [27], from which we use the raw texts of the Wikipedia pages and text snippets of the links to other pages as ground truth concepts regardless their type; cf., Figure 2 8 . These links often share the headings of anchor pages, which are in most cases some real-world entities, cf., e.g., "Arthur Heurtley House", "Price Tower", etc. Sometimes, they are also lexical variations of terms behind the link, as, e.g., the highlighted link in the fragment "the two small coastal battleships General-Admiral Graf Apraxin and Admiral Senyavin" leads to the page named "Coastal defence ship". The manual annotation of multi-word expressions in 100 randomly selected sentences with at least one multi-word link in each by a professional linguist showed that at least 63% of such phrases are indeed concepts (cf., e.g., "punctuated equilibrium", "chief of staff", "2004 presidential election"). For our work, we selected several data subsets from the collection of Wikipedia pages: 250 K pages to be weakly, but densely annotated. 9 Out of these 250K pages, 220K are used for training and 30K for validation. In addition, we use 7K Wikipedia pages with the sparse gold standard annotation as development set for choosing parameters of distant supervision and selecting the best model among several models trained with different parameters, and 7K pages with the sparse gold standard annotation as test set.

Compilation of the training corpus
We automatically create a (noisy) training corpus using two various annotators over a large unlabeled dataset: DBpedia Spotlight with the value of its confidence coefficient that gains the highest recall and our own algorithm that uses a number of rules and heuristics. Our labeling is based on the sentence-wise analysis of statistical and linguistic features of sequences of tokens. First, named entities and multiple token concepts and then single token concepts are identified. The algorithm covers the following tasks: 1. Application of a statistical NER model. A significant number of concepts in Wikipedia are capitalized terms, which can be captured by statistical named entity recognizers (NER); see the Related Work section above. Therefore, at first, SpaCy's state-of-the-art NER model [17] is applied with a successive elimination of used tokens for further processing. The next steps are applied then separately to fragments of texts located between the identified NEs. where N stands for "noun", i.e., NN|NNS|NNP|NNPS, J stands for "adjective", i.e., JJ|JJR|JJS, V -"verb" but limited to VBD|VBG|VN, CD -"cardinal number", DT -"determiner", and "of" is an exact pronoun. Each pattern matches an n-gram with two open-class lexical items and at most two auxiliary tokens between them. 3. Assessment of the distinctiveness of each selected n-gram. The distinctiveness of the selected n-grams is assessed using word co-occurrences from the Google Books Ngram Corpus [20]. Let us assume a given n-gram T 1 A 1 A 2 T 2 ∈ c k , where T 1 and T 2 are open class lexical items and A 1 and A 2 are optional auxiliary tokens, and c k is a set of all n-grams of a particular kind of pattern p k ∈ P . We use T 1 A 1 A 2 T 2 as a point of a function that passes through normalized document frequencies of a set of similar n-grams T 1 A 1 A 2 T j , j ∈ {i | T 1 A 1 A 2 T i ∈ c k } arrayed in ascending order, to find a tangential angle at this point α 1 ∈ [0 • ; 90 • ). Similarly, α 2 ∈ [0 • ; 90 • ), is a tangential angle at the point We leverage these angles to check how prominent an n-gram is, i.e., to what extent it differs from its neighbors by overall usage. In case an n-gram is located among equally prominent n-grams with a tangential angle close to 0 • , we do not consider it as a potential part of a concept since it does not show a notable distinctiveness inherent in concepts, especially in common idiosyncratic concepts. The thresholds α min1 and α min2 (α min1 ≥ α min2 ) for minimally allowed tangential angles such as max(α 1 , α 2 ) ≥ α min1 , min(α 1 , α 2 ) ≥ α min2 are predermined in development experiments. We calculate tangential Fig. 3. Relation between document frequency and coarse-grained tangential angle approximation angles through central difference approximation with a coarse-grained grid: where h was chosen large enough (h = 50 in general, and it is maximum possible on the borders) for smoothing the curve to eliminate numerous abrupt changes in document frequency with relatively low amplitude. Thus, the approximation is intentionally carried out less accurately to result in such values that in practice form a curve with longer monotonous sections; (cf., Figure 3) for an example of assessing the prominence of an n-gram "prestressed concrete", i.e., in the above notation, T 1 equals "prestressed ADJ", A 1 and A 2 are omitted, and T 2 equals "concrete NOUN". Table 1 illustrates how the approximations of tangential angles differentiate classifying nominal groups from qualifying nominal groups. The most of the candidates with a large tangential angle have a separate article in Wikipedia (i.e., they are likely to be concepts) while candidates with a small tangential angle or without an entry in Google Books (OOV) do not have a Wikipedia article in general. This shows that the chosen criterion for differentiating the concepts is suitable for weak annotation within distant supervision.
Grid search was applied to find the best combination of parameters α min1 and α min2 from the three possible tangential angles corresponding to the different levels of the distinctiveness of a concept: 85 • , 60 • , and 0 • . As a result, α min1 =

60
• and α min2 = 0 • gave the best scores on the development set and were used for annotation of the training set. 4. Combination of intersected highly distinctive parts as concepts. We combine those distinctive n-grams that share common tokens and iteratively drop the last token in each group if it is not a noun, in order to end up with complete NP candidate concepts (e.g., "value of the played card" is a potential concept corresponding to the patterns {N of DT V; V N}). Some single-word concepts already might appear at this point. 5. Recovery of missed single-word concepts. To enrich the set of candidate concepts, we consider all unused nouns and numbers in a text as single-word concept candidates. The obtained training corpus contains moderate amount of noise: the proposed annotation algorithm outperformes some baselines and might be used for CE by itself (cf. setup (A) in Tables 2 and 3 with results of evaluation in the following section).

Setup of the experiments
For our experiments, we use the realization of See et al. [28]'s pointer-generator model in the OpenNMT toolkit [18], which allows for the adaptation of the model to the task of CE along the lines described in Section 3.1 above. Instead of the attention mechanism used in [28], we use the default OpenNMT attention [22] since it showed to perform better. The model has 512-dimensional hidden states and 256-dimensional word embeddings shared between encoder and decoder. We use a vocabulary of 50k words as we rely mostly on a copying mechanism which uses dynamic vocabulary made up of words from the current source sequence. We train using the Stochastic Gradient Descent on a single GeForce GTX 1080 Ti GPU with a batch size of 64. We trained the CE-adapted pointer-generator networks of two and three bi-LSTM layers with 20K and 100K training steps on the two training datasets (obtained using Google Books and DBpedia Spotlight, respectively; see above). Validation and saving of checkpoint models was performed at each one-tenth of the number of training steps. In order to compare our extended pointer-generator model with state-of-theart techniques, several efficient entity extraction algorithms were chosen as baselines: OLLIE [23], AIDA [33], AutoPhrase+ [29], DBpedia Spotlight [5], WAT [26], 11 and several state-of-the-art NER models, namely SpaCy NER [17], FLAIR NER [1] and two deep learning-based NER models [6,19] 1213 . AutoPhrase+ was used in combination with the StanfordCoreNLP PoS-tagger (as it was reported to show better performance with PoS-tags) and trained separately on its default DBLP dataset and on the above-mentioned raw Wikipedia texts our training dataset is composed of. Its output was slightly modified by removing auxiliary tokens from the beginning and the end of the phrase to make it more competitive with the rest of the algorithms. OLLIE's and SpaCy's outcomes were also modified the same way, which improved their performance. DBpedia Spotlight was applied with two different values of confidence coefficient: 0.5 (default value) and 0.1, which increases the recall.
The performance is measured in terms of precision, recall, and F 1 -score, aiming at high recall, first of all. Since positive ground truth examples are sparse, and there are no negative examples, we treated only the detected concepts that partially overlapped the ground truth concepts as false positives. Concepts that have the same spans as the ground truth concepts are counted as true positives, and missed ground truth concepts as false negatives. This perfectly meets our goal to detect the exact match. It also allows us to penalize brute force highrecall algorithms that produce a large number of nested concepts, which are of limited use in real-world applications. Table 2 shows the reached performance on the domain-specific datasets, and Table 3 on the open domain set. The sign "*" stands for modifications made on cutting some first and last words of detected concepts in order to present them as "canonic" noun phrases, and "**" stands for removing nested concepts when this procedure gave better scores. Table 3 displays the scores for two different experiment runs. In the first, only concepts with an assigned WordNet type label in our typed Wikipedia dataset (in their majority, named entities; cf. [27] for details of the typification) were 11 FRED [10] was not used as baseline as it is not scalable enough for the task: its REST service has a strong limitation on a number of possible requests per day, and it fails on processing long sentences (approximately more than 40 tokens). 12 https://github.com/glample/tagger 13 https://github.com/kyzhouhzau/BERT-NER Table 2. Results on the domain-specific datasets "Architecture" "Terrorist groups" Setup Model P R F1 P R Step 3 of the compilation of the training corpus), where the values in parentheses correspond to α min1 and α min2 , which gave the best scores on the development set. P G (2L,18K) and P G (3L,80K) stand for pointer-generator networks with parameters shown in parentheses chosen using the development set (2 layers, 18K/20K training steps and 3 layers, 80K/100K training steps correspondingly).
To compare the performance of our model with state-of-the-art NER, we applied it to two common public datasets for NER (CoNLL-2003 and GENIA). Table 4 shows the results on the CoNLL-2003 dataset for two variants of our model (Setups B and C) trained on our large training set, without any further NER adaptation, as well as for their updated versions (Setups I, J, and K), which were fine-tuned with the training set of the shared task CoN LL T , contrasted with the results of the two genuine state-of-the-art NE recognizers [19] and [6] and DBpedia Spotlight. It should be noted that NER is a concept extraction subtask which aims at detecting less generic concepts. Consider the following statistics for the clarity: from about 69K nouns in the CoNLL-2003 training set,  only 31K nouns are part of NEs (e.g., S&P, BAYERISCHE VEREINSBANK, London Newsroom, Lloyds Shipping Intelligence Service), while the remaining 38K nouns (as in "air force", "deposit rates", "blue collar workers") are not part of NEs; as for GENIA, from about 132K nouns, only 93K form NEs (e.g., "tumor necrosis factor", "terminal differentiation", "isolated polyclonal B lymphocytes"), while the remaining 39K do not (as in "colonies", "interpretation", "notion", "circular dichroism", "differential accumulation").   Table 5 shows the results of our models fine-tuned with GENIA along with the results of concept identification by the recently published model [30], 14 which provides the most promising scores on different GENIA tasks.
14 https://github.com/ufal/acl2019 nested ner   Tables 2 and 3 show that a combination of the different variants of the proposed pointer-generator model, which do not rely on external dictionaries after being trained (cf. Setup D), outperforms in terms of recall and F 1 -score nearly all other models, including the dictionary lookup-based DBpedia Spotlight, which is a hard to beat as it was applied to "known" data. However, a combination of the pointer-generator model with DBpedia Spotlight is even better; it outperforms DBpedia Spotlight by 10%. In other words, a deep model combined with a DBpedia-lookup is the best solution for generic CE. This applies to both runs displayed in Table 3, while it is to be noted that all tested models show a lower performance in the discovery of non-named entities. In particular, the NER models expectedly suffer a dramatic drop in recall. As far as precision is concerned, DBpedia Spotlight on its own is considerably better than any other proposal on the two small domain-specific test sets, while AIDA is best on the open domain test set. This is to be expected for dictionary lookup-based strategies. Also, as to be expected, DBpedia Spotlight, applied with its confidence coefficient equal to 0.1, showed significantly better recall than with the default value of 0.5, although F 1 -score was lower. The experiment on the CoNLL-2003 dataset shows that the proposed model for generic CE performs well even without any special adjustment (F 1 = 0.8 -0.82). It can be further fine-tuned to the specific dataset resulting in scores comparable to state-of-the-art, even if not designed specifically for the NER task (F 1 = 0.93 -0.94), while its overall CE performance is better than of the targeted NER models (compare, e.g., (B)+(C) with Lample et al. (2016)'s NER in Tables 2 and 3.

Conclusions
We presented an adaptation of the pointer-generator network model [28] to generic open-domain concept extraction. Due to its capacity to cope with OOV concept labels, it outperforms dictionary lookup-based CE such as DBpedia Spotlight or AIDA in terms of recall and F 1 -score. It also shows an advantage over deep models that focus on NER only since it also covers non-named concept categories. However, a combination of the pointer-generator model with DBpedia Spotlight seems to be the best solution since it takes advantage of both the neural model and the dictionary lookup. In order to facilitate a solid evaluation of the proposed model and compare it to a series of baselines, we utilized Wikipedia pages with text snippet links as a sparsely concept-annotated dataset.
To ensure that our model is capable of extracting all generic concepts instead of detecting only texts of the page links, we ignored this sparse annotation during training. Instead, we compiled a large densely concept-annotated dataset for leveraging it within the distant supervision using the algorithm described above.
To the best of our knowledge, no such dataset was available so far. In the future, we plan to address the problem of multilingual concept extraction, using pre-trained multi-lingual embeddings and compiling another large dataset that contains a higher percentage of non-named entity concepts. The code for running our pretrained models is available in the following GitHub repository: https://github.com/TalnUPF/ConceptExtraction/.