Homonym Detection in Curated Bibliographies: Learning from dblp's Experience (full version)

Identifying (and fixing) homonymous and synonymous author profiles is one of the major tasks of curating personalized bibliographic metadata repositories like the dblp computer science bibliography. In this paper, we present and evaluate a machine learning approach to identify homonymous author bibliographies using a simple multilayer perceptron setup. We train our model on a novel gold-standard data set derived from the past years of active, manual curation at the dblp computer science bibliography.


Introduction
The unambiguous attribution of scholarly publications to their authors ranks among the most critical challenges for digital libraries. Internal user surveys and usage statistics repeatedly show that targeted author queries are predominant in the navigation patterns of those searching for scholarly material. Additionally, scientific organizations and policy makers often rely on author-based statistics as a basis for critical action. Universities and research agencies, for example, use publication and citation statistics for their hiring and funding decisions. In such cases, a correct attribution is essential.
Modern digital libraries are therefore compelled to provide accurate and reliable author disambiguation of their records. One such database is the dblp computer science bibliography, which collects, curates, and provides open bibliographic metadata of scholarly publications in computer science and related fields [11]. The database was established in 1993 by Michael Ley at the University of Trier. Since 2011, DBLP is a joint service of University of Trier and Schloss Dagstuhl LZI. As of January 2018, the collection contains metadata for more than 4 million publications, which are listed on more than 2 million author bibliographies. Every year, about 400,000 new publications are added to the database.
As can be easily seen from those numbers, the enormous growth in scholarly output in recent years has made purely manual curation of author bibliographies impracticable. Therefore, algorithmic methods for supporting author disambiguation tasks are necessary. The two most notorious problem categories are: (1) cases when different persons share the same name (known as the homonym problem), and (2) cases when the name of a particular author is given in several different ways (known as the synonym problem). Furthermore, there are even mixed cases when a person is subject to both the homonym and the synonym problem at the same time. Due to these problems, incorrect assignments of publications to authors might lead to defective bibliographies. Hence, we need proper capabilities of detecting such defects.

Our contribution
In this paper, we present and evaluate a machine learning approach to detect homonymous author bibliographies in large bibliographic databases. To this end, we train a standard multilayer perceptron (e.g., see [19,Ch. 2]) to classify an author profile into either of the two classes "homonym" or "non-homonym". While the setup of our artificial neural network is pretty standard, we make use of two original components to build our classifier: • We use historic log data from the past years of active, manual curation at dblp to build a "golden" training and testing data set of more than 24,000 labeled author profiles for the homonym detection task.
• We define a vectorization scheme that maps inhomogeneously sized and structured author profiles onto numerical vectors of fixed dimension. The design of these numerical features is based on the practical experience and domain knowledge obtained by the dblp team and uses only a minimal amount of core bibliographic metadata. We also study the impact of the individual feature groups on our classifiers effectiveness.
Please note that since our approach has been designed as an effort to improve the dblp computer science bibliography, it (in accordance with dblp's metadata curation philosophy, see Sec. 2.1) intends to keep a human curator in the loop and just uncovers defective profiles, instead of trying to algorithmically resolve the defect. Fully automatic approaches are only briefly discussed in Sec. 4.

Related work
Author name disambiguation in digital libraries has been the subject of intensive research for decades. For an overview on different algorithmic approaches see the survey by Ferreira et al. [4]. The vast majority of these approaches tackle author name disambiguation as a batch task by re-clustering all the existing publications at once. However, in the practice of a curated database like dblp, disambiguation is performed rather incrementally as new metadata is added, and by preserving the curation effort that has been made to the bibliographies in earlier iterations. Only recently, a number of approaches have been published that consider these practice-driven constraints [2,3,21,25,27].
With the recent advances made in the field of artificial intelligence, a number of (deep) artificial neural network method have also been applied to author name disambiguation problems [26,17]. However, those previous approaches focus on learning the semantic similarity of individual publications. It is still unclear how these approaches can be used to assess the homonymity of a whole author's bibliography, as is required in our scenario.
There exist many data sets derived from dblp that are used to train or evaluate author name disambiguation methods [21,7,8,9,16]. For a survey and discussion of the individual advantages and disadvantages of these recent data sets see Müller et al. [18]. All of those data sets are based on a single snapshots of the dblp database, and they concentrate on a narrow (and sometimes biased) selection of publications from dblp. To the best of our knowledge, there is no data set that considers the evolution of the curated bibliographies in dblp beside the recently published historical corrections test collection of Reitz [22], which is the foundation of our contribution (see Sec. 2.2).

Learning homonymous author bibliographies 2.1 Metadata curation at dblp
One of dblp's characteristic features is the assignment of a publication to its individual author (even in the presence of incomplete information and homonymous or synonymous names) and the curation of bibliographies for all authors in computer science. In order to guarantee a high level of data quality, this assignment is a semi-automated process that keeps the human data curator in the loop and in charge of all decisions. In detail, for each incoming publication, the mentioned author names are automatically matched against the existing author profiles in dblp using several specialized string similarity functions [12]. Then, a simple social network analysis (mainly based on the co-author linkage) is performed to rank the potential candidate profiles. If a matching author profile is found, the authorship record is assigned, but only after the ranked candidate lists have been manually checked by the human data curator. In addition, missing, incomplete, or erroneous information in either the incoming publication metadata or the matched author profiles is updated, and some further normalization is applied.
In cases that remain unclear even after a curator checked all candidates, a manual in-depth check is performed, often involving external sources. However, the amount of new publications processed each day makes exhaustive detailed checking impossible, which inevitably leads to some incorrect assignments. Thus, while the initial checking of assignments ensures an elevated level of data quality, a significant number of defective author profiles still find their way into the database, especially in the case of homonymous and synonymous names.
To further improve the quality of the database, another automated process checks all existing author profile in dblp on a daily basis. This process is designed to uncover defects that become evident as a result of newly added data or corrected entries. By analyzing an author profile and its linked coauthor profiles for suspicious patterns, this process can detect probably synonymous profiles [24]. For the detection of probably homonymous profiles, no automated process has existed prior to the results presented here, and the dblp team has been largely relying on hints from the community to become aware of such situations [23]. A simple clustering approach has been used to visualize the (in-)coherence of an author profiles coauthor community, yet without providing conclusive information (see Fig. 1a).
If a suspicious case of a synonymous or homonymous profile is validated by manual inspection, then the case is corrected by either merging or splitting the author profiles, or by reassigning a selection of publications from one profile to another. By doing so, in 2017 alone, in 9,731 cases author profiles were merged and a total of 3,254 author profiles have been split, while in 6,213 cases partial profiles have been redistributed. This curation history of dblp forms a valuable set of "golden" training and testing data set for curating author profiles [22].

A gold-data set for homonym detection
We use the historic dblp curation data from the embedded test collection as described by Reitz [22,Sec. 3.2] to build a "golden" data set for homonym detection. This collection compares dblp snapshots from different timestamps t 1 < t 2 and classifies the manual corrections made to the author bibliographies between t 1 and t 2 . For this paper, we use the historic data from the dblp log files for the observation interval [t 1 , t 2 ] with t 1 = "2014-01-01" and t 2 = "2018-01-01". The test collection is available online [22] under Open Data Commons Attribution License (ODC-By).
Within this test collection, we selected all source profiles from the defect cases of type "Split" as our training and testing instances of label class "homonym". That is, these are profiles at timestamp t 1 where a human curator at some point later between t 1 and t 2 decided to split the profile (i.e., the profile has actually been homonymous at timestamp t 1 ).
Additionally, from all other profiles in the dblp data set at timestamp t 1 , we selected the profiles which did either (a) contain non-trivial person information like a homepage URL or affiliation information, or (b) at least one of the author's names in dblp ends by a "magic" 4-digit number (i.e., the profile has been manually disambiguated [11] prior to t 1 ) as instances of label class "non-homonym". This selection makes sense since those profiles had all been checked by a human curator at some point prior to t 1 , and the profile has not been split in the period between t 1 and t 2 . While this is not necessarily a proof of non-homonymity, such profiles are generally more reliable than an average, random profile from dblp.
In order to further rule out trivial cases for both labels, we dropped all profiles that at timestamp t 1 did list either less than two publications or less than two coauthors. We ended up with a "golden" data set of 2,802 profiles labeled as "homonym" and 21,576 profiles labeled as "non-homonym" (i.e., a total of 24,378 profiles) from the dblp data set at timestamp t 1 . Please be aware that the labels in this data set come with a onesided error: The cases labeled "homonym" are reliable since we have proof of such a correction from the historic dblp test collection. On the other hand, the cases labeled "non-homonym" have been constructed heuristically and may not always be correct.

Vectorization of author bibliographies
In order to train an artificial neural network using our labeled profiles, we need to represent the non-uniformly sized author profiles at timestamp t 1 as numerical vectors of fixed dimension. To this end, our vectorization makes use of two precomputed auxiliary structures: • local coauthor clusterings: For each profile, we use a very simple connected component approach to cluster its set of coauthors: First, consider the local (undirected) subgraph of the dblp coauthor network containing only the current person and all direct coauthors as nodes. We call this the local coauthor network. Then, remove the current person and all incident edges from the local coauthor network. The remaining connected components form the coauthor clusters of the current person. See Fig. 1a for a small example.
• title word embeddings: We train a vector representation of all title words in the dblp corpus using the word2vec algorithm [15]. In particular, we use the DeepLearn-ing4J [5] implementation of word2vec, using the skip-gram model and 150 embedding dimensions. To allow for reproducibility, an overview of the further model hyperparameters 1 is given in Fig. 1b. In the vectorization below, we use this word embedding model as basis to compute paragraph vectors (also known as doc2vec [10]) of whole publication titles, or even collections of titles.
The design of the feature components of our vectorization is based on the experience and domain knowledge obtained by the dblp team during the years of actively curating the dblp bibliographies. That is, we identified different features that are implicitly and explicitly taken into consideration whenever a human curator at dblp is assessing the validity of a profile. In particular, we make use of the following feature groups in our vectors. A detailed listing of all features is given in Fig. 2. In Section 3, we will study the impact of each feature group on the classifier's performance.
• group B: Basic, easy-to-compute facts of the author's profile, i.e., the number of publications, coauthors, and coauthor relations on that profile.  • group C: Features of the local coauthor clustering, like the number of clusters and features of their size distribution. The aim of this feature set is to uncover the incoherence of local coauthor communities, which experience shows to be symptomatic of homonymous profiles.
• group T: Geometric features (in terms of cosine distance) of the embedded paragraph vectors for all publication titles listed on that profile. This feature set aims to uncover inhomogeneous topics of the listed publications, which might be a sign of a homonymous profile.
• group V: Geometric features (in terms of cosine distance) of the embedded paragraph vectors for all venues (i.e. journals and conference series) listed on that profile, where each venue is represented by the complete collection of all titles published in that venue. This feature set also aims to uncover inhomogeneous topics by using the aggregated topical features of its venue as a proxy for the actual publication.
• group Y: Features of the publication years listed on that profile. The aim of this feature group is to uncover profiles that mix up researchers with different years of activity.

Classifier setup
As classifier we define a standard multilayer perceptron [19, Ch. 2] with three hidden layers. In particular, for each experiment, our classifier has a variable number of input nodes (depending on the concrete selection of feature groups we use in each experiment, see Sec. the label classes "non-homonym" and "homonym". The activation function used in the hidden layers are rectified linear units (ReLU), while the output layer uses the softmax activation function in order to allow for an interpretation of the output values as a probability distribution. We use binary cross-entropy as loss function and stochastic gradient descent as optimization algorithm. L2 regularization is used to fight overfitting. Further hyperparameters of our classifier are listed in Fig. 3.

Implementation
We implemented and trained our classifier using the open-source Java library DeepLearn-ing4J [5]. While Python-based implementations like Tensorflow or Keras seem to be more commonplace in academic research contexts, the production environment of dblp and dblp's custom code is mainly based on Java. Hence, an enterprise-level Java library was the best fit for our live production environment. All experiments have been conducted on a standard Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz desktop PC, using Java8 and allocating 16 GB of RAM to the JVM. Before running our experiments, we randomly split our gold-data profiles into fixed sets of 80% training and 20% testing profiles. Since neural networks work best when data is normalized, we rescaled all profile features to have an empirical mean of 0.0 and a standard error of 1.0 on the training data. For each set of vectorization feature groups we studied, 25 models have been trained independently (i.e., using a different random seed each) on the training profiles and evaluated on the testing profiles.

Quality measures
In information retrieval contexts, the quality of algorithms is often evaluated by measures like precision, recall, and F1-score. However, in the case of unbalanced label classes, these three measures are known to give misleading scores [20]. Homonym detection is such an unbalanced case. In fact, in our gold data set we find our labels to be unbalanced with a population prevalence of the homonym defect of 11.4%. And there is no reason to believe that this ratio is even representative of a bibliography database as a whole, where experience suggests that the true ratio might be much closer to 1.0% or even 0.1%. Hence, when evaluating homonym classifiers, we propose to rather use other measures like Matthews correlation coefficient (MCC) [14] or the area under the receiver operating characteristic (AUROC) [13] instead, which are known to yield reliable scores for diagnostic tests even if class labels are severely unbalanced [20]. However, in Fig. 4, we still give precision, recall, and F1-score in order to allow for our results to be compared with other studies.

Results
The results of our experiments are summarized in Fig. 4. As can be seen from the MCC scores -and probably not surprisingly -our classifier is most effective if all studied feature groups are taken into consideration (i.e., feature set "BCTVY"). Note that for this set of features, precision is much higher than recall. However, this is actually tolerable in our real-world application scenario of unbalanced label classes: We need to severely limit the number of false-positively diagnosed cases (i.e., we need a high precision) in order to have our classifier output to be practically helpful for a human curator, while at the same time in a big bibliographic database, the ability to manually curate defective profiles is more likely limited by the team size than by the number of diagnosed cases (i.e., recall does not necessarily need to be very high).
One interesting observation that can be made in Fig. 4 is that the geometric features of the publication titles alone do not seem to be all too helpful (see feature set "BT" in Fig. 4), while the geometric features of the aggregated titles of the venues seem to be the single most helpful feature group (see feature set "BV" in Fig. 4). We conjecture that this is due to mere title strings of individual publications not being expressive and characterful enough in our setting to uncover semantic similarities. One way to improve feature group T would be to additionally use keywords, abstracts, or even full texts to represent a single publication, provided that such information is available in the database. However, it should be noted that even in its limited form, feature group T is still able to slightly improve the classifier if combined with feature group V (see feature set "BTV" in Fig. 4).
In addition to our experiments, we implemented a first prototype of a continuous homonym detector to be used by the dblp team in order to curate the author profiles of the live dblp database. To this end, all dblp author profiles are vectorized and assessed by our classifier on a regular basis. This prototype does not just make use of the binary classification as in our analysis of Fig. 4, but rather ranks suspicious profiles according to the probability of label "homonym" as inferred by our classifier (i.e., the softmax score of prediction label "homonym' in the output layer). The resulting top entries of the ranking are presented to the dblp curators as a web front end in order to easily access, assess, and (if necessary) resolve the suspicious profiles. A screenshot of the web front end is given in Fig. 5. As a small sample from practice, we computed the top 100 ranked profiles from the dblp XML dump of April 1, 2018 [1], and we checked those profiles manually. We found that in that practically relevant top list, 74 profiles where correctly uncovered as homonymous profiles, while 12 profiles where false positives, and for 14 profiles the true characteristic could not be determined even after manually researching the case.

Discussion
In this paper we presented and evaluated a classifier to detect homonymous author profiles which is motivated by the day-to-day curation work of the dblp computer science bibliography. Our classifier was made possible by deriving a gold-data set of profiles from the past years of active manual curation. In order to apply this approach to any other curated bibliography database, a similar extensive history of curation log data is required. Hence, if such a curation log does not yet exist at your digital library, we strongly encourage you to start collecting such information now in order to enable you to make use of this valuable data set in the future.
However, it should be noted that our vectorization of profiles is based on observations made for the field of computer science, and your mileage may vary if you want to apply it to fields of different characteristics with respect to coauthor communities, choice of publication venues, or frequency of publishing. Since the actual selection vector features is modular (as demonstrated by our experiments), it should be possible to derive and tune a fitting set of features for another field of study.
Furthermore, our approach is geared towards a scenario where a human curator is taking care of the actual task of fixing the homonymous profile, as is the philosophy employed at the dblp computer science bibliography. A desired extension of our work is probably a fully automatic approach which also fixes (or at least suggests a solution for) the homonymous profile. By using the pairwise semantic similarity of publications (e.g. [26,17]), a clustering of the defective profile might yield such a solution, which is a topic of future research.