Published January 1, 2018 | Version v1
Other Open

Concept Extraction with Convolutional Neural Networks


For knowledge management purposes, it would be interesting to classify and document documents automatically based on their content. Concept extraction is one way of achieving this automatically by using statistical or semantic methods. Index-based keyphrase extraction can extract relevant concepts for documents, the inverse document index grows exponentially with the number of words that candidate concpets can have. to adress this issue, the present work trains convolutional neural networks ( CNN N-gram (ie, a consecutive sequence of N characters or words) is a concept or not, from a training set with labeled examples.The classification training signal is derived from the Wikipedia corpus, knowing that the N-gram represents a concept. The CNN input feature is the vector representation of each word, derived from a word embedding model; the output is the probability of an N-gram to represent a concept. Multiple configurations for vertical and horizontal filters were analyzed and configured through a hyper-parameterization process. The results demostrated precision of between 60 and 80 percent on average. This precision has been drastically reduced as N.However, combined with a TF-IDF based relevance ranking, the top five N-gram concepts for Wikipedia article showed a high precision of 94%, similar to part-of-speech (POS) tagging for concept recognition combined with TF-IDF CNN seems to prefer sequences of N-grams as identified concepts, and thus can not identify sequences of words normally ignored by other methods. Furthermore, in contrast to POS filtering, the CNN method does not rely on predefined rules, and could thus provide language-independent concept extraction.


+ ID der Publikation: hslu_56935 + Art des Beitrages: Wissenschaftliche Medien + Sprache: Englisch + Letzte Aktualisierung: 2019-01-22 18:08:41



Files (1.1 MB)

Name Size Download all
1.1 MB Preview Download

Additional details