Using Titles vs. Full-text as Source for Automated Semantic Document Annotation

We conduct the first systematic comparison of automated semantic annotation based on either the full-text or only on the title metadata of documents. Apart from the prominent text classification baselines kNN and SVM, we also compare recent techniques of Learning to Rank and neural networks and revisit the traditional methods logistic regression, Rocchio, and Naive Bayes. Across three of our four datasets, the performance of the classifications using only titles reaches over 90% of the quality compared to the performance when using the full-text.


INTRODUCTION
A significant amount of today's largest Knowledge Graph on the web, the so-called Linked Open Data cloud 1 , consists of metadata about documents such as scientific papers and news articles.Domain-specific SKOS vocabularies are used to describe the semantics of these documents, SKOS (short for: Simple Knowledge Organization System) 2 is an established W3C standard for modeling thesauri in domains such as economics, politics, social sciences, news, etc.Those thesauri are often of high quality since they are manually crafted as well as maintained by domain experts, and made freely available on the web 3 .
The challenge is to successfully use those SKOS thesauri to semantically annotate the documents.However, the full-text PDF of the documents may not be available (linked from the documents' metadata) or may not be legally accessible due to licensing or copyright issues (even though there is a link to the PDF).Thus, it is highly desirable to conduct a semantic annotation of the documents with the SKOS thesauri by just using the already published documents' metadata like the title, year, authors, etc.In contrast to the full-text of documents, the metadata is directly available on the Linked Open Data cloud, accessible in RDF format, and can be processed with no legal barriers for semantic annotation.Conducting semantic annotations by using only the title (or further metadata of the documents) is challenging, since the title is short and thus carries only little information compared to the full-text.The process of semantic annotation is a multi-label classification task where not only one label is to be chosen as annotation but a set of labels since many concepts of the SKOS thesauri are needed to appropriately describe the semantics of the documents.
We tackle the challenge of conducting a semantic multi-label classification into SKOS thesauri by using only the title metadata of the documents.To this end, we run an extensive series of experiments to compare established methods and recent methods from machine learning for document classification.The goal is to decide whether it is possible to reach a comparable classification performance when using only the title of the documents.It is noteworthy that all the compared approaches operate on the underlying machine learning level which makes a comparison with prevalent end-to-end ontology tagging systems such as SOLR ontology tagger 4 and MAUI 5 difficult.We instead show that despite not using the hierarchical properties of the thesaurus, the presented methods outperform the best-performing methods that do make use of the hierarchy such as the ones of our own prior work [8].Apart from the well-known multi-label classification baseline k-nearest neighbors (kNN) and support vector machines (SVM), we revisit traditional text classification methods such as Naive Bayes, Rocchio, and logistic regression (LR).We also include the prominent Learning to Rank (L2R) approach, as well as a modern variant of neural networks motivated by the success of the Deep Learning field.Please note, the present work focuses solely on using the titles of documents, since they are the richest metadata attribute and contain keywords relevant in the domain.In the future, we may also incorporate other metadata like authors' names and publication year.
The results of our experiments show that it is possible to reach a competitive performance for semantic annotation using solely the title of documents, compared to exploiting the full-text of the documents.Using a sample-averaged F 1 measure as evaluation metric, we compare the automated predictions of semantic annotations from different methods with those annotations provided by domain experts.We run our experiments over four large-scale documents corpora of different origin and domain with a total of over 300, 000 documents.All datasets offer professional labels, i. e., manual annotations from domain experts.Two datasets are from professional scientific libraries in economics and politics while the other two datasets are the well-known news corpora from New York Times and Reuters.In the past, algorithms of the lazy learner family such as kNN used to dominate multi-label classification tasks on such datasets with a high amount of classes [8,25].However, we show that eager learners such as logistic regression and feed-forward neural networks outperform lazy learners.Most eager learners have the benefit of O(N parameters ) time complexity to predict a label set for an unseen document, which is important when applying an automated semantic annotation process for on-the-fly enrichment of metadata on the Linked Open Data cloud.In contrast, lazy learners as well as Learning to Rank need to store and traverse O(N training examples • N features ) space to predict the labels for a single new document at test time.Finally, focusing on the metadata also allows direct processing of data in published RDF format (e. g. the rdfs:Literal and rdfs:label information) without accessing the full-text of the documents at all.Overall, we conclude that eager learning algorithms are well-suited for automated semantic annotation of RDF resources in Linked Data.Summarized, the contributions of this work are: (1) To the best of our knowledge, the first large-scale systematic comparison of multi-label classifiers applied to either the full-text or only the titles of documents.(2) Results that show that eager learners such as neural networks and linear models outperform lazy learners even when a high amount of possible labels is considered.(3) We offer evidence that using only the title for high-dimensional multi-label classification is a reasonable choice for semantic annotation of resources where only metadata is available, such as documents modeled in RDF on the Linked Open Data cloud.
The remainder of the paper is organized as follows: Below, we present an overview of the state of the art in multi-label classification of text and related fields.In Section 3, we describe our experimental apparatus.We depict different methods for conversion of unstructured text to feature vectors in Section 3.1.The classifiers and their respective configurations are elaborated in detail in Section 3.2.We describe the four datasets used for our experiments as well as the evaluation metrics in Section 4. The results are presented in Section 5 and discussed in Section 6, before we conclude.

RELATED WORK
Most earlier work on the multi-label classification task with many possible output labels relies on nearest neighbor searches (kNN).Using the union of labels as well as separately voting for each individual label among neighbors is a common choice in these nearest neighbor-based classifiers [8,25,28,29,33].Concept extraction [7] refers to explicitly finding known concept-specific phrases in the documents.The extracted concepts are re-weighted by inverse document frequency, as in the well-known TF-IDF [23] retrieval model.In our prior work [8], we have conducted an exhaustive comparison of concept extraction and feature re-weighting methods using kNN as a multi-label classifier.
Recent progress in the field of topic modeling with latent Dirichlet allocation [3] suggest using labeled variants [1,20,24] for multilabel classification.While these techniques outperform SVMs, we found from pre-experiments that they do not scale well regarding the number of considered labels.In the closely related field of (label) recommendation, Tuarob et al. [30] as well applied topic models to obtain a ranking of the labels.
In the biomedical domain, the most popular approach is Learning to Rank [11,19].The algorithm learns a ranking of the MeSH terms.In multi-label classification, however, a hard decision is necessary to enable fully automated classification.Thus, Learning to Rank is typically adjusted for multi-labeling by imposing a hard cut-off.There are also approaches that use Learning to Rank along with dynamic cut-off techniques [16].The most prominent approach to adapt classifiers for multi-labeling is binary relevance [26,28].Other options include the chaining [21] as well as stacking [9,27] of classifiers.While the former is not well-suited for high amounts of considered labels, we also include a variation of the latter idea in our comparison.Bi and Kwok [2] approach the multi-label classification task from a different direction.They strive for more efficient multilabel classification and proper treatment of label correlation by transforming the label indicator matrix.Zhang and Zhou [32] have proposed to train a separate neural network for each label along with a dedicated loss function.However, this approach does not scale to high amounts of possible output labels.One year later, the same authors suggest a lazy-learning multi-label variant of kNN [33], which is considered in our comparison.Nam et al. [18] adapt fully connected feed-forward neural networks for multi-label classification by learning a threshold that determines whether a label should be assigned or not.
While the related fields of label recommendation and single-label text classification are broad, only few works consider multi-label classification with a large amount of possible output labels.From these, the dominant approaches are based on nearest neighbors searches, i. e. lazy learners and Learning to Rank.The considered works all use either short texts or full-text as input data but do not compare these two different input variants.Thus, we offer the first systematic comparison of text vectorization methods and lazy as well as eager learning algorithms for the multi-label classification problem with many possible labels applied to either title data or full-text data.

SEMANTIC ANNOTATION APPARATUS
We present an end-to-end apparatus for semantic annotation of unstructured text.Figure 1 shows our generic text processing pipeline that we used for the experiments.Each path through the graph resembles a possible configuration.In the following Section 3.1, we describe the conversion from unstructured text to a vector representation.In Section 3.2, we elaborate in detail on the classification methods that we have compared.

Vectorization
Counting terms and extracting concepts.In the first step of our text processing pipeline, the raw text needs to be converted into a vector representation that can be supplied as input to the classifiers.As features, we use the counts of term occurrences in the text (TF) as well as the number of times a concept provided by a domain specific thesaurus can be extracted from the text (CF).A concept is a set of concept-specific phrases.In case of SKOS format, each concept has one preferred phrase (skos:prefLabel) and optionally a set of alternative phrases (skos:altLabel).We extract these conceptspecific phrases from the text using a finite state machine.When there is more than one possible match in a sequence of words, we favor the longest phrase.We assume that longer phrases carry more specificity.Hence, the occurrences of a concept (set of conceptspecific phrases) are counted in the same way as term occurrences.The effect of concept extraction is to ensure that domain-specific synonyms encoded in the thesauri are mapped to the same concept.The concepts are also directly associated to the respective class labels.Hence it is left to the learning algorithm, to decide about the concrete label assignment, given the extracted terms or concepts.
Discounting frequent terms and concepts.Inverse document frequency (IDF) is a re-weighting scheme introduced in the 1970s by Salton and Buckley [23] which has proven to work well for information retrieval [15].IDF discounts features that occur in many documents of the corpus, and thus do not hold discriminative information.This can be both term counts and counts of extracted concepts.Let D be the set of documents, then the IDF re-weighted score for some term or concept w in a document | {d ∈D:w ∈d } |+1 .To avoid division by zero, both the nominator and the denominator are incremented by one, as if there was one artificial document containing all possible terms and concepts.This can happen because the set of concepts given by the thesaurus but the data itself might not cover all of these possible concepts.The fraction as a whole is as well incremented by one, to ensure that words that appear in all documents are not completely discarded.
Okapi BM25 is an extension of IDF by Robertson et al. [22] that slightly modifies the IDF term to include the average length of a document.It offers two hyper-parameters for interpolating the difference between the current document length and the corpuswide mean document length.The literature suggests to use BM25 especially for fields with short texts using hyper-parameters k = 1.6 and b = 0.75 [15].Hence, variants of our text vectorization methods using BM25 instead of TF-IDF re-weighting are included in our comparison.
Combining terms and concepts.After re-weighting by either inverse document frequency or BM25, the resulting vectors are normalized to unit length (with respect to the L2-norm).This leads to desirable invariance to document length.Besides using only either the term frequency (TF) or the concept frequency (CF), we concatenate the respective feature vectors (CTF).

Classification
In the second step of the pipeline, a classifier is consulted to predict the desired set of labels based on the vector representation of the input text (compare Figure 1).Given training data, the classifiers have the opportunity to learn how to associate the features with the respective class labels.Lazy learners merely copy their input at training time, shifting the main computational effort to test time (described in Section 3.2.1).On the other hand, eager learners use the training data for adapting their parameters according to the correct classification result.We describe those in detail in Section 3.2.2.Some of the learning algorithms are only designed for single-label classification (SVM, logistic regression, Naive Bayes), others do only return a ranked list of possible labels (kNN, Rocchio, Learning to Rank).We describe the multi-label adaption strategies for both cases in Section 3.2.3.

Lazy Learners.
Nearest Neighbor Classifier.The most typical lazy-learning algorithm is k-nearest neighbors (kNN).All training examples are stored along with their class annotations.At test time, the k nearest neighbors with respect to some distance metric (we chose cosine) vote on class membership.For multi-label problems, variants are proposed that assign the union of label annotations in the neighborhood as well as conducting a separate vote for each label [25].By auto-optimizing the k hyperparameter for these methods, we found k = 1 to be the optimal value in our setting (as in our prior work [8]).In this case all multi-label variants coincide to copy the label set from the nearest neighbor of the training set.
Rocchio Classifier.The Rocchio classifier, or nearest-centroid classifier resembles a light-weight modification of the nearest neighbor classifier.During training, only the centroid of each class is stored.The classification result is then determined by the nearest of these centroids at test time.In multi-label classification however, the classifier is only capable to return a ranked list of labels based on the distance to the respective centroids.As in the nearest neighbor classifier above, we use cosine distance as criterion.

Eager Learners.
Naive Bayes.The Naive Bayes classifier is one of the most traditional classifiers for text classification tasks.We consider two Naive Bayes variants, multinomial and Bernoulli.In the multinomial variant, the features of term or concept frequencies are assumed to be generated by a multinomial distribution.The Bernoulli variant only takes the occurrences of (binary) features into account, which leads to penalizing the non-occurrences of features.The Bernoulli variant is an intuitive approach for short text such as titles since duplicate words are rather infrequent, while the multinomial variant is more intuitive for full-texts.For both variants, we apply Lidstone-Smoothing with α = 10 −5 .The main drawback of Naive Bayes is the assumption of statistical independence among the input features.
Linear Models.Generalized linear models [12]  For the loss function J , we consider two variants: logistic loss J logistic (y, p) = ln(1 + exp(−p • y)) as in logistic regression (LR) and hinge loss J hinge (y, p) = max(0, 1 − p • y) as in linear support vector machines (SVM).At test time, the binary decision is determined by the side of the hyperplane, on which the document in question falls.We employ stochastic gradient descent as an optimizer for these generalized linear models, which is known to yield good generalization on large-scale datasets [4,6,34].We apply the learning rate schedule , where t 0 is chosen by a heuristic of Léon Bottou [5].We average the weights w over time, which allows higher learning rates and leads to faster convergence [5].In this setting, we empirically determined α = 10 −7 to be a good hyperparameter value for all datasets (in the range 10 −1 , 10 −2 , . . ., 10 −9 ).This leads to comparatively high initial learning rates and low regularization.
Learning to Rank.Learning to Rank (L2R) refers to a set of techniques that can be used to learn the ranking of a list from training data.As suggested by Huang et al. [11], we restrict the supplied list to those labels that occur in the k neighboring documents (we empirically determined k = 45).Those labels, that are also assigned to current document in question should be ranked higher than the others.To learn the ranking, we use the neighborhood, overlap, and translation-probability features proposed by Huang et al. [11].Hence at test time, the union of labels among the k nearest neighbors are ranked via the learned parameters.However, the algorithm itself does not offer the possibility of hard decisions on label assignments.Thus, we chose to cut off the ranked list at the position of the average number of assigned labels in the training data.In our experiments, we made use of the RankLib library 6 and found LambdaMART to outperform other list-wise L2R algorithms.
Multi-Layer Perceptron.As representative for the neural network family, we employ a fully connected feed-forward neural network with one hidden layer, a so-called multi-layer perceptron (MLP).Compared to the linear models, the MLP has an additional intermediate hidden layer h with a nonlinear activation function f .Thus, we first compute h = f W 1 x + b 1 , and then y = W 2 h + b 2 .The output y is then scaled to the interval (0, 1) by the sigmoid function σ as in logistic regression and compared to the gold-standard by crossentropy.The gradient for updating the parameters is computed by the chain-rule, also known as back-propagation.The optimization itself is carried out by Adam [13] with the default hyper-parameters and α = 0.01.We chose a hidden layer size of 1000 and use rectified linear units [17] as activation function f (except for the NYT dataset where we use tanh due to numerical difficulties).For regularization, we apply dropout [10] with a probability of 0.5.The intermediate hidden layer can be regarded as a fine-tuned taskspecific word embedding, which enables the classifier as a whole to learn nonlinear relationships among the features.To convert the odds σ (y) into a binary decision, several approaches suggest to use a threshold learning technique [18,27].In our initial experiments, however, we experienced that the most recent threshold learning technique yields rather unsatisfactory results in terms of the F 1 measure.Instead, we use a fixed threshold of 0.2.

Multi-Label Adaption.
Binary Relevance.Linear models as well as Naive Bayes are restricted to mutually exclusive class assignments by design.Only one class out of all possible ones is selected.In multi-label classification, however, multiple labels need to be assigned.The most common approach to adapt such classifiers is to train one classifier per class, which distinguishes its respective class from all others, i. e. decides for binary relevance [28] (also known as one-vs-all or one-vs-rest).
The training documents are supplied to all label-specific classifiers.Depending on the prevalence of the label that corresponds to the respective classifier, the example is either treated as positive or as negative.At test time, the classification result is composed of the binary decisions for each label.
Classifier Stacking.Multi-value classification stacking [9] refers to a technique where the final classification result is composed by two classifiers.The so-called base-classifier returns a ranked list of label predictions with confidence scores.Then for each class, a meta-classifier takes these confidence scores along with the position in the ranked list as input and outputs a binary decision for the specific class.This technique enables transforming any classifier that returns confidence scores into a multi-label classifier.As meta-classifiers, we use decision trees with Gini impurity as splitting criterion.To limit complexity, we generate training data only for those meta-classifiers, whose class is among the top 30 of the base-classifier's ranking [9].We use this decision tree module (abbreviated with the suffix *DT) as an alternative to hard cut-offs in Learning to Rank (see Section 2, and the fixed thresholds in multi-layer perceptrons (see Section 3.2.2).For comparison with the original work of Heß et al. [9], we also consider Rocchio as a base-classifier.We furthermore experiment with applying the decision tree module on top of binary-relevance logistic regression.

EXPERIMENTAL SETUP
We describe the datasets used for our experiments in Section 4.1, before we outline the experimental procedure in Section 4.2.We then depict the conducted preprocessing and introduce our evaluation metric of a sample-based F 1 measure in Section 4.3.We choose a sample-based evaluation measure since it will assess the classification quality of each document separately.This reflects the workflow of manual document classification as it is done by domain experts in scientific digital libraries as well as journalists.

Datasets
We have conducted our experiments on four datasets of English documents: two datasets are obtained from scientific digital libraries in the domains of economics and political sciences along with two news datasets from Reuters and New York Times.Table 1 summarizes the basic statistics of the datasets.For each document in the datasets, there are manually created gold-standard annotations provided by respective domain experts, who work as professional subject indexers in the corresponding organizations.In addition, each dataset provides a domain-specific thesaurus that serves as controlled vocabulary of the gold-standard.Its concepts are used as target labels in our multi-label document classification task.The thesaurus also offers sets of concept-specific phrases (i.e. skos:prefLabel and skos:altLabel in case of SKOS format) that are used for concept extraction from the documents' full-text and titles [7].The economics dataset consists of 62, 924 documents and is provided by ZBW -Leibniz Information Centre for Economics.The annotations are taken from the Standard Thesaurus Wirtschaft (STW) version 9 7 , which is a controlled domain-specific thesaurus for economics and business studies maintained by ZBW.The thesaurus contains 6, 217 concepts with 12, 707 concept-specific phrases.From these concepts, 4, 682 are used in the corpus and thus considered in the multi-label classification task.Each document is annotated by domain experts with on average 5.26 labels (SD: 1.84).The political sciences dataset has 28, 324 documents.Similar to the economics dataset, we made a legal agreement for the political sciences dataset with the German Information Network for International Relations and Area Studies8 that is providing the documents.The labels are taken from the thesaurus for International Relations and Area Studies9 , which contains 9, 255 concepts (and an equivalent number of concept-specific phrases, i. e., there are no alternative phrases).From these concepts, 7, 234 are used in the corpus.Each document in the dataset has on average 8.07 labels (SD: 3.03).The Reuters RCV1-v2 dataset contains 805, 414 articles.We chose articles where both the titles and the full-text of the documents are available.From this set of documents, we randomly selected 100, 000 articles to match the scale of the scientific corpora.
In our experiments, we employ the thesaurus re-engineered from the Reuters dataset by Lewis et al. [14].The thesaurus contains 117 concepts and a total of 173 concept-specific phrases.From these concepts, 101 are used in the corpus.Each document was annotated with on average 3.21 (SD: 1.41) labels.The New York Times Annotated Corpus Dataset (NYT) contains 1, 846, 656 articles.Each article has two sets of annotations, consisting of annotations created by a professional indexing service and annotations which were added by the authors using a semi-automatic system.We used the annotations provided by the indexing service because it is reasonable to expect that they are more consistent and of higher quality (cf.[9]).As for the Reuters dataset, we chose a random subset of 100, 000 documents containing both full-text and titles.The number of concepts in the NYT dataset is 25, 226.From these concepts, 6, 809 are used in our random sample.Each document is annotated with on average 2.53 (SD 1.78) labels.Like the political sciences dataset, each concept consists of only a single specific phrase.

Procedure
Vectorization methods.We compare the different vectorization of the input text as shown in Figure 1 and described in Section 3.1.One vectorization is based on term frequencies (TF-IDF) and the other is based on concept frequencies (CF-IDF).We experiment with the reweighting method BM25 using term frequencies and BM25C using concept frequencies.The concatenation of both terms and concepts is denoted by CTF-IDF and BM25CT, respectively.As classifier, we employ kNN with cosine distance.The performance of kNN relies on the assumption that documents are well represented by the features and that similar documents have similar labels.Therefore, its classification performance is a good indicator for the quality of the features.
Classification methods.After determining the best-performing vectorization method, we compare lazy learning as well as eager learning classifiers of Sections 3.2.1 and 3.2.2combined with the multi-label adaption methods of Section 3.2.3,where appropriate.We leverage the linear models (SVMs and logistic regression) to perform multi-label with binary relevance, i. e. training one classifier per label.To adapt the Learning to Rank approach and the multi-layer perceptron to multi-labeling, we consider using thresholds as well as stacking with decision trees.We also experiment with stacking the decision tree module on top of binaryrelevance logistic regression.Careful tuning of the hyperparameters is crucial to the success of machine learning algorithms, especially in those multi-label classification tasks, where only few training examples are available per class.Striving to identify well-suited hyperparameters that are invariant to the concrete dataset, we keep all hyperparameters (as denoted in Section 3) fixed across all experiments and datasets.

Preprocessing and Evaluation
Preprocessing.Prior to counting terms and extracting concepts, both the input text and the concept-specific phrases of the thesauri are subject to preprocessing steps.This includes discarding all characters except for sequences of alphabetic characters with a length of at least two.Words connected with a hyphen are joined (i.e., the hyphen is removed).Detected words were lower-cased and lemmatized based on the morphological processing of WordNet [31].
Evaluation.For evaluation, we separate each dataset into 90% training documents and 10% test documents and perform a 10-fold cross-validation, such that each document occurs exactly once in the test set.Hence for each test document, we compare the predicted labels with the label set of the gold standard and evaluate the F 1 measure.The F 1 measure is the harmonic mean between precision, Please note, there is a possibility that all documents annotated with a specific label fall only into one test set.Despite no training data is available for these labels, we do not exclude those from our evaluation metric.Finally, we report the mean sample-based F-score over the ten folds of the cross-validation.

RESULTS
In this section, we describe the results of our experiments.Due to the high amount of possible pipeline configurations, we applied a step-by-step approach.For both the text vectorization step and the classification step, we search for a local optimum solution to find the best overall classification strategy.
Results for Vectorization Methods.Table 2 shows the results for the text vectorization experiment.The term-based vectorization method TF-IDF perform consistently better than the purely conceptbased vectorization CF-IDF methods on both the titles and the full-text.The difference ranges from 0.003 on Economics to 0.307 F-score on Reuters.When combining the term vector with the concept vector, the performance is at least as good as the other text vectorization methods and in many cases yields better results.This is more noticeable on titles than on full-texts.BM25 re-weighting does not improve the results compared to TF-IDF neither in case of the titles nor the full-text.Rather, we observe a decrease in performance by up to 0.13.These experiment using a nearest neighbor classifier indicates that CTF-IDF is the best-suited vectorization method.Henceforth, we use CTF-IDF for comparing the performance of the classifiers.
Results for Classifiers.The results of comparing the different classifiers are documented in Table 3.As shown in the table, Bernoulli Bayes has a slight advantage over multinomial Bayes for titles.On the other hand, the multinomial variant has a slight disadvantage on full-texts.However, both methods consistently fall far behind kNN on full-texts.In the case of working with titles, the Bayes classifiers are able to keep up with kNN on two datasets.RocchioDT's scores are depending on the datasets and range from the lowest (Reuters) to a score only slightly different from kNN (NYT, political sciences).The generalized linear models SVM and logistic regression are close to each other.The difference is no more than 0.04 for any dataset.Considering Learning to Rank, we observe that the technique yields consistently lower scores than the multi-layer perceptron.Overall, the eager learners SVM, LR, L2R and MLP outperform both Naive Bayes and the lazy learners Rocchio, and kNN.Among all classifiers, MLP dominates on all datasets apart from NYT on titles, where LRDT achieves a .021higher score.While the stacked decision tree module increases the F-scores of logistic regression on all datasets with fewer than 100 documents per label (all but Reuters), the impact of the stacking method is inconsistent for the Learning to Rank and MLP approaches.It is noteworthy that there are cases where a classifier performs better on the title data than the same classifier applied on the full-text data.These are Bernoulli Bayes on the Reuters dataset and RocchioDT on the economics dataset.As a general rule, however, full-texts generate higher scores than the titles.Comparing different classifiers across titles and full-text, we can make the observation that some classifiers trained on titles outperform others that were trained on the full-text.Apart from the NYT corpus, the eager learners LR, LRDT and MLP on titles are superior to kNN on full-texts.Finally, we compare the F-scores of the best-performing multi-layer perceptron on titles with its scores obtained on full-text.On the NYT dataset, 58% of the F-score is retained when using only titles.On the political sciences and economics datasets, the retained F-score is 83% and 91%, respectively.On the Reuters dataset, the MLP using solely titles retains 95% of the F-score that is obtained with full-text information available.

DISCUSSION
The results show that multi-label classification of text documents can be reasonably conducted using only the titles of the documents.Over all datasets, the multi-layer perceptron on titles retains 82% of the F-score obtained on full-text.This gives an empirical justification for the value of automated semantic document annotation using metadata.From the first experiment, we find that combining words with extracted concepts as features is preferable over one of them alone.Concepts hold valuable domain-specific semantic information.The term frequency on the other hand, holds implicit information which is as well important for correct classification.Eager learners are, by design, capable of learning which terms or concepts need to be associated to the respective class.The results show that also lazy learners benefit from this joint representation.The second experiment shows that eager learners such as logistic regression and MLP consistently outperform lazy learners for multi-label classification.This result extends recent advancements in multi-labeling [18,26] towards document classification scenarios with many possible output labels and only few examples per class.Inspecting the results for titles and full-text, the best-performing classifiers still perform better on the full-text.This is not surprising since the full-text holds considerably more information (including the title).However, for all datasets apart from the NYT dataset, the difference in F-score of the best-performing MLP is small.The difficulties in classifying the documents in the NYT dataset can be explained by a characteristic that the titles consist only of 4 words on average.There may be a lower bound on the title length to perform the classification task, since a short title limits the amount of available information and thus prohibits discrimination.From the other datasets, we can state that an average of 7 words per title leads to at least 80% retained F-score.Thus, it would require further investigation to understand the specific influence of the title length on the classification performance.The complexity of a multilabeling problem depends on the number of available documents per label, independent of whether the full-text or the titles are used.Especially binary-relevance classifiers suffer from conservative label assignments (high precision, low recall), when many negative examples and only few positive examples are presented during training.While the results of the stacked decision tree module are inconsistent for MLP and L2R, it does alleviate the conservative assignments problem of binary-relevance, when only few documents per label are available.
In our experiments over four large-scale real-world corpora covering a broad range of domains (economics, political sciences and news), we did not limit the complexity by excluding rare labels and kept all independent variables as well as hyperparameters fixed.In our prior work [8], we have used the thesaurus hierarchy to model label dependencies which improves the classifications obtained by kNN.Despite not making use of the hierarchy anymore, we are able to achieve even higher absolute F-scores using eager learning techniques and supplying term features in addition to extracted concepts.We can therefore drop the constraint of a hierarchical organization among the labels.Due to this minimal amount of requirements and invariant configurations of the text processing pipeline, we can expect our findings to generalize to a wide range of other corpora.
To validate the practical impact of the experimental results, we have conducted a qualitative assessment of the experimental results in an expert workshop with three subject indexing specialists at ZBW, the national library for economics in Germany.The experts state that titles can be sufficient for classification of scientific documents.They further noted that titles contain less information than what an intellectual indexer has available when manually conducting the classification tasks for the documents.They also pointed out that researchers carefully chose their titles for findability.The experts argued that reasonably good automatic indexing based on titles is valuable since it does not raise legal problems compared to processing full-text as discussed in the introduction.We conclude that using the documents' title for automated semantic annotation is not only technically possible with a high quality but also valuable from a practical point of view.

CONCLUSION
We have shown that it is reasonable to conduct semantic annotations of documents by just analyzing the titles.Our experiments show that by using titles, a performance of over 90% can be reached w.r.t to the classification performance obtained when using the full-text of the documents.This opens many new possibilities for using document classification even when only little input data is available such as titles obtained from the documents' metadata on the Linked Open Data cloud.
To encourage further research in the field and to invite other researchers to compare and develop further methods, the full source code of our generic text processing pipeline is available on GitHub 10 .We invite practitioners and developers to use and extend the framework.

Figure 1 :
Figure 1: Illustration of the configurable text-processing pipeline used for our experiments.The pipeline starts with the vectorization of the input text, followed by feature reweighting, classification and evaluation.The emphasized edges and nodes show the most successful strategy applied to title data.
d ∈ D is defined as: TF-IDF(w, d) = TF(w, d) • IDF(w, D), where IDF(t, D) = 1 + log |D |+1 use the training examples to learn a decision boundary.This decision boundary is a separating hyperplane specified by a linear combination of the input features w•x−b = 0.The parameters w and b are optimized to minimize the regularized training error: 1 n n i=1 J (y i , y(x i ))+αR(w) where y(x) = w • x − b is the model's output and αR(w) is a regularization term on the model's weights such as the L2-norm.

Table 1 :
Statistics for the datasets: |D| documents, |C | concepts in the thesaurus, |L| labels assigned in the dataset, d/l mean documents per label, l/d mean labels per documents along with median l/d 50 , V vocabulary size, w/d mean terms per document, and c/d mean concepts per document

Table 2 :
Sample-averaged F-scores of the text vectorization methods with using kNN as common classifier i. e. true positives w.r.t false positives, and recall, i. e. true positives w.r.t false negatives.When no label is predicted, the precision is set to zero.The F-scores are then averaged over the test documents.We chose this sample-based F 1 measure over class-averaged or global variants because it is closest to an assumed application, where each individual document needs to be annotated as good as possible.

Table 3 :
Sample-averaged F-scores for classification methods with using the best vectorization method CTF-IDF