Lightweight Lexical and Semantic Evidence for Detecting Classes Among Wikipedia Articles

A supervised method relies on simple, lightweight features in order to distinguish Wikipedia articles that are classes (Shield volcano) from other articles (Kilauea). The features are lexical or semantic in nature. Experimental results in multiple languages over multiple evaluation sets demonstrate the superiority of the proposed method over previous work.


INTRODUCTION
Motivation: Enjoying continuous growth through collaborative contributions by human editors in hundreds of languages, Wikipedia is a key resource in efforts to organize the world's knowledge into large, open-domain knowledge repositories. A variety of knowledge repositories [1,17,26], including Freebase [3], Wikidata [43], Knowledge Graph [39] and Concept Graph [46], derive at least their initial, core knowledge from semi-structured textual content available in Wikipedia articles. Wikipedia and knowledge repositories derived from it are useful in a variety of tasks pertaining to knowledge acquisition from text [17,25,[45][46][47], text analysis [24,34,35] and information retrieval [4,8,18,22,38,41] including commercial Web search, helping to potentially transform search results from sets of hyperlinks to relevant documents into sets of concepts directly relevant to users' queries [39]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. WSDM '19,  Most Wikipedia articles correspond to concepts that are instances ("Kilauea") as opposed to classes ("Volcano"), in part due to the encyclopedic nature of Wikipedia. But many language editions of Wikipedia contain hundreds of thousands of articles each. This makes the subset of Wikipedia articles that are likely classes significant in size, even when compared to resources such as Word-Net [12], which focus specifically on representing not instances but classes. Neither Wikipedia nor other larger knowledge repositories derived from it distinguish articles that are classes. Contributions: The method proposed in this paper relies on simple, lightweight lexical features collected from the text of Wikipedia articles, as well as semantic features from outside of Wikipedia, as evidence towards detecting a subset of articles that are classes. The features are applicable to English and other languages. They are inexpensive to collect. The features do not require linguistic preprocessing tools such as part of speech taggers, named entity recognizers, syntactic or semantic parsers. They are either lexical, expected to apply horizontally, widely across articles independently of their domains; or semantic (knowledge-based), expected to apply vertically, narrowly only within limited domains. Over various combinations of existing evaluation sets being used as training vs. test data, the method acquires classes at better levels of trade-off between precision and recall than those achieved by a recently-introduced method.

DETECTION OF CLASSES 2.1 Task
Classes: Classes are placeholders for sets of instances that share common properties. A class such as "Shield volcano" is a placeholder for a set of instances such as "Kilauea" and "Hofsjökull". In contrast, "Kilauea" is an instance and not a class, since it cannot act as a placeholder for any other set of its own instances. Classes ("Shield volcano") may be specializations of other classes ("Volcano"), if the instances of the former share more properties, in addition to the properties shared by the instances of the latter. Through specialization, classes effectively organize the set of all possible instances into a hypothetical conceptual hierarchy, whose leaf concepts at the bottom are instances, whereas intermediate concepts would be iteratively more general classes. Task: As a consequence of its encyclopedic nature, the very large majority of articles in Wikipedia correspond to concepts that are instances ("Kilauea", "Hofsjökull") as opposed to classes ("Shield volcano"). In random samples, as many as 97 out of 100 Wikipedia articles may be instances [27]. But even if instances dominate, that still leaves what we estimate to be low hundreds of thousands of articles in Wikipedia that might be classes. That is a much larger and finer-grained potential set of classes than, e.g., the small number of types available in the type hierarchy in Freebase [3]. Unfortunately, Wikipedia does not distinguish articles that are classes from those that are not. Neither do large knowledge graphs such as [39,46], which otherwise rely heavily on creating and maintaining internal concepts for most if not all Wikipedia articles. The goal of the method being proposed here is the selection of as many Wikipedia articles that are classes as accurately as possible, out of all Wikipedia articles.

Applications
Enriching Knowledge Repositories: The detection of Wikipedia articles that are classes can be applied immediately and transitively to concepts in large knowledge repositories [3,39,46] that were created from and correspond to those Wikipedia articles. Expansion of Lexical Dictionaries: Due to the high cost of manual maintenance and expansion, valid open-domain concepts may be missing from expert-created lexical resources like WordNet. Wikipedia articles extracted as classes represent an inexpensive source of high-quality candidate concepts (e.g., "Wooden roller coaster", "Polar filament") for manual insertion into future, expanded versions of such lexical resources. Topic Decomposition: By decomposing potentially compositional Wikipedia articles (e.g., "Photochemical logic gate") into one or more constituent articles ("Photochemistry", "Logic gate"), the meaning of a compositional article can be approximately defined in aggregate from the meaning of the constituent articles [6]. Existing methods [28] for decomposing Wikipedia articles lack "additional signals to better distinguishing between fully compositional and noncompositional" articles (cf. [28]). Wikipedia articles that are classes contribute towards such a signal. Articles that are classes are more likely to be compositional, whereas articles that are not classes ("Joseph Lovering House") are more likely to be non-compositional. Wikipedia Hierarchies: Edges in hierarchies constructed over Wikipedia articles [13,16] are hypernymy relations from more specific articles to more general categories or articles. However, recent hierarchies do not contain edges from "Mahmoud Hashemi Shahroudi" and "Jaldapara National Park", on one hand, to "Chief Justice of Iran" and "Reserve forest", on the other hand. In fact, the articles "Chief Justice of Iran" and "Reserve forest" are not among the intermediate nodes in the hierarchies. But they are extracted as classes by the method being proposed here (e.g., based on the presence of the plural-form category "Category:Chief Justices of Iran" for the article "Chief Justice of Iran"), thus pointing to potential gaps in hierarchies extracted from Wikipedia.

METHOD 3.1 Lexical Features Within Wikipedia
Clues in Wikipedia Articles: By design, the proposed method for detecting Wikipedia articles that are classes relies on simple, shallow analysis of occurrences of the article title ("Shield volcano") within the article text. The analysis consists simply in searching, among such occurrences, for three types of clues: 1) contexts surrounding the occurrences in article text, which match one of a few, simple contextual patterns; 2) morphological variation among different occurrences; and 3) presence of lowercase occurrences. The clues apply to English as well as to other, though not all, languages.
Lexical Clue 1: Pre-defined contextual patterns: The context around an occurrence of the article title appears in text is sometimes strongly suggestive of the article title being a class. Specifically, an occurrence ("[..] shield volcano [..]") preceded by an indefinite article ("[..] a shield volcano [..]") may indicate that the occurrence is a countable noun, especially if the sequence is part of what might be a definition of the concept described in the article ("[..] A shield volcano is a type of [..]"). It is not unreasonable to expect such definitions to appear at least in some Wikipedia articles, either towards the beginning of the article, to quickly define the concept before diving into its details; or later in the article. Concretely, if the case-insensitive occurrence O of the article title in a sentence from the article is such that it matches one of the following languagespecific patterns, then the occurrence is taken as evidence that the article ("Shield volcano") might be a class: ]") or both. Such morphological variation is taken as evidence that the article ("Volcan bouclier") might be a class. Referring to a concept in plural implies that there can be more than one instance of the concept, which suggests that the concept is a class. Conversely, the absence of plural forms ("Kilaueas") suggests that the concept ("Kilauea") is not a class. Lexical Clue 3: Capitalization: Lowercase occurrences of the article title within sentences ("[..] es un volcán en escudo que [..]") suggest that the article ("Volcán en escudo") might be a class. In contrast, mixed-case occurrences ("[..] Con el tiempo, Kilauea se edificó [..]") suggest that the article ("Kilauea") might not be a class. Features from Wikipedia Articles: From the three types of clues, several counts are computed as features for each Wikipedia article, over the occurrences of the article title within the text of the article: a) C 1 (contextual pa ern match) is the count of case--insensitive occurrences that match a pre-defined contextual pattern; b) C 2 (identity), C 3 (plural) are the counts of case-insensitive occurrences in identical vs. plural form; c) C 4 (mixedcase), C 5 (lowercase) are the counts of casesensitive occurrences in mixed case vs. lowercase; d) C 6 (mixedcase plural), C 7 (lowercase plural) are the counts of case-sensitive occurrences of plural forms in mixed case vs. lowercase; and e) C 8 (plural category) is the count of case-insensitive parent Wikipedia categories [32] of the article ("Shield volcano") that are plural forms ("Category:Shield volcanoes") of the article title. A few binary features are also derived from existing count features that are expected to provide relatively stronger (i.e., reliable) clues that the article is a class: f) is-positive(C 1 (contextual pa ern match)) and ispositive(C 7 (lowercase plural)) are set to 1 or 0, depending on whether the respective counts are positive or 0. Since articles have at most one similarly-titled parent category in plural form in Wikipedia, the C 8 (plural category) already acts as a binary feature. The features are inspired by rules and heuristics proposed in [26,27].
In order to compute the features described above from Wikipedia articles in a given target language, language-specific prerequisites consist in the creation of contextual patterns, possibly using existing patterns in other languages as inspiration; and the identification of frequent, though not necessarily complete, rules for plural noun formation in the target language. The pre-requisites are relatively inexpensive and easy to satisfy for many target languages other than English. Once the pre-requisites have been satisfied, the actual automatic computation of the set of simple, numerical features from each Wikipedia article in the target language is lightweight, inexpensive and fast. It does not require any linguistic processing tools to be available in the target language, whether for part of speech tagging, parsing etc.

Semantic Features Outside Wikipedia
Lexical vs. Semantic Features: Intuitions presented and features collected so far are lexical. They apply horizontally, across all Wikipedia articles. They are independent from the underlying domains (e.g., geology vs. entertainment vs. finance) or categories (e.g., volcanoes vs. actors vs. index funds) under which the various topics might belong. In contrast, semantic features do not generalize across domains or categories. Instead, they are expected to apply only to possibly-narrow, vertical slices through the space of all topics. With an eye towards existing large knowledge repositories and the kind of knowledge they contain, there are at least two types of semantic (knowledge-based) clues to consider. Semantic Clue 1: Hypernyms: If hypernym topics were available for a Wikipedia article, they would not necessarily be expected to help much in deciding whether the topic is an instance or class. Indeed, classes and instances may easily share hypernyms, such as "Kilauea" and "Shield Volcano" sharing the hypernym "Volcano", making the presence of a particular hypernym unreliable evidence, at best, on whether the topic is or is not a class. However, if an article ("Kilauea") were known to be an InstanceOf of a hypernym ("Volcano"), then the presence of the hypernym could be useful in deciding whether the article is an instance or a class. Such hypernyms, if available for a Wikipedia article, are taken as evidence towards determining whether the article might be a class. Semantic Clue 2: Properties: Similarly to hypernyms, the presence of certain properties known to apply to a Wikipedia article could be relevant, even if only in a narrow domain rather than across domains. Topics are likely to be instances and not classes, if they are known to have properties such as being located in a particular location such as "Hawaii"; or to have a certain date of birth such as "1936"; or be associated with a certain record label such as "Armada Music". Properties, if available for a Wikipedia article, are taken as evidence towards determining whether the article might be a class. Source of Semantic Features: Considering Wikipedia and other knowledge repositories derived from Wikipedia, a prominent candidate source of hypernyms and properties of Wikipedia articles is Wikidata, for several reasons. First, Wikidata marks any Wikipedia articles that are equivalent to Wikidata topics explicitly, making it trivial to find Wikidata topics equivalent to Wikipedia articles and vice versa. Second, just like Wikipedia, Wikidata is an actively developed, growing resource that benefits from continuous human editing and curation. Third, Wikidata not only contains hypernyms explicitly, but distinguishes between InstanceOf and SubclassOf relations. Since its early stages, InstanceOf relations have been well represented among other types of relations in Wikidata [43]. In comparison, in Wikipedia, parent categories   Figure 1, given a Wikidata topic (e.g., Q188698 corresponding to "Kilauea"), the set of its InstanceOf hypernyms initially consists of Wikidata topics ("Volcano") that are right arguments of InstanceOf relations, if any, whose left argument is the given topic. The set is then expanded, by transitively collecting other Wikidata topics ("Volcanic landform", "Landform") that are right arguments of SubclassOf relations in Wikidata, whose left arguments are hypernyms ("Volcano") collected so far for the given topic. Thus, the path that connects a given Wikidata topic up to one of its InstanceOf ancestor topics consists of a first edge that must be an InstanceOf relation; and optionally additional edges that must be SubclassOf relations. The set of properties of a Wikidata topic is the set of predicates of relations, if any, whose left argument is the given topic in Wikidata. The predicates InstanceOf and SubclassOf are excluded. It is only the predicates (record label), and not also the right arguments ("Armada Music") of the relations, that are collected as properties. As long as a property applies and there is some value for it as the right argument, what the actual value is is deemed irrelevant.
The properties and InstanceOf hypernyms of a given Wikidata topic (Q188698) are transferred to the Wikipedia article ("Kilauea") marked as equivalent to the Wikidata topic in Wikidata.
The set of all properties or InstanceOf hypernyms, collected for one or more Wikipedia articles, is converted into a set of Wikidatabased binary features computed for each Wikipedia article. Each of the binary features indicates whether the respective property or InstanceOf hypernym was or was not collected for the respective Wikipedia article. The set of binary features represent semantic features, which can be added to the lexical features.  [20]. Other loss functions or non-linear algorithms might be used. The features collected for various sets of Wikipedia articles are used as training or test data, as described later. An entry from the test data is classified as being a class vs. not a class, if the score computed by the classifier is positive vs. non-positive. Data Sources: The experiments operate over the Wikipedia snapshot used in [27]. Disambiguation and redirect pages are discarded. Semantic features are extracted for each Wikipedia article from this snapshot, based on data from a snapshot of Wikidata from June 2018. For Wikipedia articles with no equivalent Wikidata topics in the Wikidata snapshot, their sets of semantic features are assumed to be empty. Evaluation Sets: Three evaluation sets introduced in [27] serve as the source data for training and testing the proposed method. Each evaluation set consists in pairs of a Wikipedia article in English and a gold label indicating whether the article is a class or not.
The first evaluation set, S W , is derived from Instance relations available in WordNet [12]. The set collects the left (more specific) arguments ("Mauna Loa") in such relations as gold non-classes (i.e., gold instances), and the right (more general) arguments ("Volcano") as gold classes; and maps those arguments to equivalent Wikipedia articles, if any, based on a pre-existing, manually-created set of such mappings [27]. The second and third evaluation sets are random samples of Wikipedia articles annotated manually. The random samples are documents (articles) from Wikipedia drawn either uniformly (S D ) or after query-based automatic filtering meant to reduce the effect of instances being much more numerous than classes in Wikipedia (S Q ). The three sets contain 5,735 (S W ), 2,000 (S D ) and 1,000 (S Q ) entries, divided into 547 and 5,188 (S W ), 73 and 1,927 (S D ) and 362 and 628 (S Q ) gold classes and gold non-classes respectively (cf. [27] for more details on the evaluation sets). Training and Test Sets: The evaluation sets are employed as training data or test data, in various possible combinations. For example, one possible combination is to employ S W as training data and S Q as test data. Individual entries in the data serve as positive examples, if their gold label is class; or negative examples, if their gold label is instance. In any combination, entries that may be shared among the training set and the test set are removed from the training set but retained in the test set. There are only a few such shared entries, namely 3, 7 and 8 entries, for the combinations of S W and S D , S W and S Q , and S D and S Q respectively. Thus, regardless of which evaluation sets are selected as training vs. test sets, no entries appear in both the training and test sets. At the same time, an evaluation set selected as a test set is always used in its entirety without changes, ensuring that any results computed over the evaluation set are directly comparable to results reported in previous work over the same evaluation set. Extraction Parameters: As its occurrences are identified in the article text, the article title is first normalized, to remove portions within parentheses, thus converting "Circuit (administrative division)" into "Circuit" for that purpose. Such portions are not consistently present vs. absent in titles of Wikipedia articles; they are similarly removed in previous methods operating over Wikipedia data [13,27,32]. Depending on the feature being computed, the occurrences may be case-insensitive or case-sensitive. Features that involve plural forms are computed based on a few approximate, not necessarily complete, simple rules for plural formation in the respective target languages. For example, plural forms are often formed by adding the suffix "-s" in French, English and Spanish, or by adding the suffix "-es" in the latter two languages. In English, the rules are complemented by lemmatization data from WordNet [12], thus accommodating irregular plural forms like "corpora" for "corpus" in the article title "Text corpus".
Only for semantic features, separately for each combination of a training set and a test set, features activated for fewer than five of the gold classes from the training set are discarded.

Results with Lexical Features
Extraction over English Articles: Table 1 summarizes the performance of the proposed method, when trained and tested over features collected over Wikipedia articles in English. In the upper vs. middle vs. lower portions of the table, the test sets are the S W , S D and S Q evaluation sets. The training sets are the other two evaluation sets or their union. Precision scores are above 0.9 across the various test sets. Recall scores over S D are the lowest at around 0.6, followed by S Q , which are lower than over S W . The F 1 -scores exhibit the same trend, exceeding 0.7 for S D and reaching almost 0.9 for S W . For each test set, combining both of the other evaluation sets into a single training set brings only a small improvement in F 1 -scores, relative to using only one of other evaluation sets. Extraction over Articles in Other Languages: Table 2 gives detailed scores when the proposed method is tested on target languages other than English, namely French (in the upper portion of the table) or Spanish (in the lower portion). Test data uses features collected over Wikipedia articles in the target language rather than in English. In contrast, for each target language, training data is collected either from English articles; or from articles in the same target language, namely French or Spanish. The former choice corresponds to what one may refer to effectively as cross-language training and testing, whereas the latter involves same-language    be interpreted as an indication that the method does not perform as well in other languages as it does in English. However, closer inspection reveals that the lower same-language scores are likely due simply to less training data being available in that setting. Indeed, when collecting training data from French rather than English articles, the number of English articles in the training sets for which features can be collected via the target language is lower than the total number of entries in the respective training sets by 6%, for S W ; by almost 3 times, for S Q ; and by more than 4 times, for S D . In other words, when collecting training data in the target language rather than in English, a large fraction of the evaluation sets used as training sets is lost, when training on S D ; but relatively little is lost, when training on S W . Not surprisingly, in Table 2, switching from cross-language training and testing to same-language training and testing reduces recall the most precisely when training on S D (and testing on S W or S Q ). For example, in the case of testing in French, changing training data from English to French, while consistently testing in French, changes recall from 0.744 to 0.558, when training on S D and testing on S W . Comparison to Baseline Methods: The proposed, supervisedlearning method (denoted L r n ) is compared in Table 3 against several baselines.
The first baseline (denoted B wd ) extracts a set of Wikipedia articles as classes, based solely on data available in Wikidata. In Wikidata, more specific topics are connected to more general topics via InstanceOf or SubclassOf relations. The baseline decides whether  and proposed method (L r n ). The extraction is based on evidence collected either from Wikipedia articles in English only (En), or from Wikipedia articles in multiple languages (En∪Fr∪Es). The proposed method is trained on S W ∪S Q a Wikipedia article is a class or not, based on whether the Wikipedia article is equivalent (in Wikidata) to a Wikidata topic that is the right (more general) argument of some InstanceOf or Sub-classOf relation in Wikidata. For example, one of the InstanceOf relations in Wikidata connects the topic Q308801 ("Achtung Baby") to Q482994 ("Album"). The baseline collects the Wikipedia article marked in Wikidata as equivalent to the right argument Q482994 of the relation, namely the article "Album", as being a class. A possible variant would be to additionally require the collected Wikipedia article to appear as the right argument in a minimum number of distinct InstanceOf relations, with higher thresholds expected to give higher precision at the expense of lower recall. The second baseline method (denoted B r b ), introduced in [27], is a rule-based method that, based on occurrences of the title of a Wikipedia article in the article text, decides whether the article is a class or not. Note that, by comparing against this baseline, the proposed method is also transitively compared to a series of other baselines against which the baseline itself was compared in previous work; cf. [27] for descriptions of those other baselines (e.g., [26,50]) and their scores on the same evaluation sets.
In the upper portion of Table 3, the methods rely on evidence collected from English articles. In the case of the proposed method, this means training and testing over features collected from English articles. Between the two baselines, B wd gives lower scores than B r b . The gap in recall scores is narrower over the S W evaluation set but wider over S Q and especially S D . It suggests that B wd can more easily extract relatively more general classes ("University") but has limited utility in extracting relatively more specific classes ("Neutron research facility"). Between the baselines and the proposed method, although the proposed method gives generally lower precision scores than the B r b baseline, it compensates with disproportionately higher recall scores. It produces consistently higher F 1 -scores across the evaluation sets.
In the lower portion of Table 3, the methods take advantage of evidence collected simultaneously from English, French and Spanish (En∪Fr∪Es), in order to extract (or predict, in the case of the proposed method) articles in English that are classes. For this purpose, articles extracted as classes in each of the other target languages are first mapped to their equivalent articles in English, if any (or discarded, otherwise); then merged (via union) together with the articles extracted in English. For example, if a method extracted the articles "Bridge" in English, "Volcan bouclier" in French and {"Escudo de red", "Furgoneta"} in Spanish as classes, then the  Table 5: Complete sets of English articles listed in Wikipedia as child articles of various parent categories, which are extracted as classes only by the baseline method B r b or only by the proposed method L r n trained on S W ∪S Q articles it would extract in English based on all languages simultaneously would be {"Bridge", "Shield volcano", "Van"}. For the B wd baseline, extraction based on multiple languages is in fact no different than extraction from English alone. Indeed, Wikidata marks a Wikipedia article in English as equivalent to a topic in Wikidata, regardless of whether or how many Wikipedia articles in other languages are also equivalent to the same Wikidata topic. For the B r b baseline as well as the proposed method, extraction based on multiple languages slightly increases the scores in Table 3. The proposed method gives higher F 1 -scores than the baselines B r b and B wd . Absolute Recall: Going beyond the evaluation sets used in experiments so far, Table 4 evaluates the impact of the proposed method (and the baseline methods) going back to the goal stated early on, namely identifying as many Wikipedia articles that are classes as accurately as possible. The table ignores accuracy and focuses instead on absolute recall. It shows the fraction of all Wikipedia articles in English that are identified as classes. The proposed method has a significant advantage over the baselines in extracting more articles as classes, both when using evidence only in English and when using evidence in multiple languages. The table confirms that the proposed method has better recall than the baselines B wd and B r b , while other experimental results presented so far separately show that the B r b baseline has no significant precision advantage over the proposed method. Per-Category Comparative Recall: In another practical comparison beyond the existing evaluation sets, Table 5 shows the complete sets of articles extracted as classes only by the B r b baseline or only by the proposed method, out of all child articles listed directly under a few categories in Wikipedia. For example, out of the 6 child articles of the parent category "Category:Astronomical objects", only the proposed method extracts the article "Deep-sky object" as a class. Note that the selected parent categories are not guaranteed to have only classes as child articles. For example, one of the child articles of the category "Category:Towers" is "Tower of Elahbel", which is not a class. In fact, another category "Category:21stcentury actresses" is selected precisely because most, if not all, of its child articles are expected to not be classes, e.g., "Jelena Jovanova". Extracting any of them as classes would likely be incorrect. In a positive sign for either method relative to the other, none of them extracts any additional (likely incorrect) child articles of the category "Category:21st-century actresses" as classes. For other categories in Table 5, the proposed method often manages to extract classes that the B r b baseline cannot. The classes extracted only by the proposed method look encouragingly accurate. Relative Feature Contribution: Separate ablation experiments temporarily disable subsets of features pertaining to a) contextual patterns (C 1 , is-positive(C 1 )); b) morphological variation (C 3 , C 6 , C 7 , C 8 , is-positive(C 7 )); and c) capitalization (C 5 , C 7 ). When compared to enabling all features, ablation reduces F 1 -scores in English by 15.5% (a), 21.4% (b) or 21.9% (c) respectively on average, when using either S Q or S W as training set, and S D as test set. The results show that the different types of clues and associated features all contribute towards overall performance. Discussion: As it computes features based on occurrences of the article title in the text of the article, the proposed method assumes that such occurrences exist and are indeed mentions of the concept being described in the article. The method is likely to perform worse when either of these assumptions does not hold. First, for shorter articles, too many features collected from the articles may be zero, for them to be useful in determining whether the articles are classes or not. If the articles are equivalent to other, better fleshed-out articles in other languages, they may still be extracted as classes based on evidence in those other languages. Although the English article "Vereda" is not extracted as a class based on evidence in English, it is still extracted as a class based on evidence in Spanish, since its equivalent article "Vereda" is extracted as a class in Spanish. Conversely, although "Abbesse" is not extracted as a class in French based on evidence in that language, it is still extracted as a class based on evidence in English, via its equivalent English article "Abbess". But for articles ("Trade item") with no equivalent articles in other languages, the lack of enough evidence towards the computed features remains an issue. Second, occurrences of the article title in article text are sometimes not really mentions of that concept but rather of some other concepts. The occurrences may be incorrectly interpreted as presence of any of the three types of lexical clues, namely contextual patterns, in "[..] A Funky Situation is the 21st studio album [..]" (for "Funky Situation"); morphological variation, in "[..] The UFOs can survive for far longer [..]" (for "UFO (TV series)"); or capitalization, in "[..] from scratch in a simple editor that is part of Scratch [..]" (for "Scratch (programming language)").

Results with Semantic Features
Impact of Features from Wikidata:   Table 7: Examples of Wikipedia articles that are either no longer or continue to be decomposed by the method from [28], based on the absence vs. presence of the decomposed articles among the classes extracted by the proposed method when trained on S W ∪S Q clue that the topic to which it applies is an instance. But the Wikidata topic Q318541 has this property (with the value "Daniel Kirkwood"), although its equivalent Wikipedia article "Kirkwood gap" is clearly about a class and not an instance. In a somewhat similar phenomenon, this time affecting hypernyms instead of properties, both topics Q832799 ("Yellow Submarine") and Q922853 ("Campaign song") have the InstanceOf hypernym Q7366 ("Song") in Wikidata, although the Wikipedia article equivalent to Q922853 is clearly about a class of songs and not about a particular instance. Lexical and semantic features are inexpensive to compute. The former capture phenomena such as morphology and case. The latter are available thanks to the significant amount of decentralized human labor driving the growth and maintenance of Wikidata. Although semantic features alone are not as useful, their addition to lexical features does add value, albeit small, on top of lexical features. Semantic features may still be useful if more training data became available. With or without semantic features, the results given by lexical features alone, which are superior to baselines such as B r b , are encouraging.

Impact on Downstream Applications
Topic Decomposition: By restricting the decomposition method from [28] to only decompose Wikipedia articles that are extracted as classes by our proposed method, the precision of decomposed articles that are indeed compositional increases. Table 7 shows examples that the method from [28] still or no longer decomposes, based on their presence among extracted classes. Class Categories and Article Hierarchies: Applications may benefit from Wikipedia categories [16,22] in addition to articles. Articles extracted as classes by the proposed method can be used to extract Wikipedia categories that are classes. A simple procedure is as follows. First, the subset of all categories ("Category:Albums") whose names are identical to, or plural forms of, one of their child Wikipedia articles ("Album") that are extracted as classes are themselves extracted as classes. Second, their descendant categories (recursively children of child categories) such as "Category: Armada Music albums", whose name chunks ("albums") are identical to the case-insensitive ancestor category name, are also extracted as classes.
The procedure produces almost twice as many categories as classes, as there are unique hypernym categories in a Wikipedia hierarchy from articles to categories constructed in [16]. Examples of categories extracted as classes here, but not extracted as hypernyms in [16], are "Category: Greek trance musicians" and "Category: 20thcentury Argentine painters".

RELATED WORK
WordNet [12] distinguishes classes from other concepts, by manually connecting instance concepts (synsets) up to their more general concepts through an (Instance) relation rather than a generalpurpose hypernymy relation. Implicitly, WordNet concepts linked up through hypernymy rather than Instance relations are classes, just as concepts that are hypernyms of some other concepts are also classes. Cyc [21] makes distinctions between SetOrCollections, or classes, and Individuals. But only thousands of Cyc concepts are equivalent to any Wikipedia articles [50].
Wikipedia [36] does not distinguish articles that are classes. Articles are organized into fine-grained categories, which in turn are organized into iteratively coarser-grained categories. Collecting Wikipedia categories as classes would not be an adequate solution to identifying classes in Wikipedia. Wikipedia categories often do not correspond to classes. An intermediate step in [30] aims at distinguishing such Wikipedia categories, based on the coherence of the coarse-grained types available in DBpedia [1] for their descendant Wikipedia articles. The method from [50] also identifies classes among Wikipedia categories rather than articles. Since many Wikipedia articles do not have a corresponding, similarlytitled Wikipedia parent category, evidence towards which categories are classes has limited utility towards distinguishing which articles are classes. Existing work where Wikipedia serves as the reference resource in some task tends to rely on articles rather than categories [2,5,14,29,35], with few exceptions [22].
Most methods [26,50] that distinguish classes in Wikipedia require access to a part of speech tagger, a syntactic parser or a named entity recognizer and apply to English data only. In contrast, and similarly to the rule-based method introduced recently in [27], our method does not need access to linguistic processing tools and applies to multiple languages. Its results are superior to [27] and transitively to other baselines evaluated in [27].

CONCLUSION
Playing an increasing role in enhancing Web search results, the largest knowledge repositories available rely on data available in and associated with Wikipedia articles. The method proposed in this paper associates Wikipedia articles with a type of information that is missing from existing knowledge repositories, namely distinguishing articles that are classes. Current work investigates the role of ngrams and syntactic dependencies as low-level features collected from article text in Wikipedia; and the role of evidence not just around occurrences of the article title ("Shield volcano") within the article, but also around disambiguated occurrences within other Wikipedia articles ("[..] Paka is a shield volcano located in [..]") and, more generally, within other Web documents.