Linking Named Entities across Languages using Multilingual Word Embeddings

Digital libraries are online collections of digital objects that can include text, images, audio, or videos in several languages. It has long been observed that named entities (NEs) are key to the access to digital library portals as they are contained in most user queries. However, NEs can have different spellings for each language which reduces the performance of user queries to retrieve documents across languages. Cross-lingual named entity linking (XEL) connects NEs from documents in a source language to external knowledge bases in another (target) language. The XEL task is especially challenging due to the diversity of NEs across languages and contexts. This paper describes an XEL system applied and evaluated with several languages pairs including English and various low-resourced languages of different linguistic families such as Croatian, Finnish, Estonian, and Slovenian. We tested this approach to analyze documents and NEs in low-resourced languages and link them to the English version of Wikipedia. We present the resulting study of this analysis and the challenges involved in the case of degraded documents from digital libraries. Further works will make an extensive analysis of the impact of our approach on the XEL task with OCRed documents.


INTRODUCTION
Digital libraries are composed of a large number of digital contents (e.g. journals, books, magazines, videos, and so on) in several languages about diverse subjects (e.g. history, languages, politics, ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. JCDL '20, August 1-5, 2020, Virtual Event, China © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7585-6/20/08. . . $15.00 https://doi.org/10.1145/3383583.3398597 sciences, philosophy, and so on). Named entities have been demonstrated to be essential to digital library access as they are included in a majority of the search queries submitted to digital library portals [3]. However, the spelling of an entity is language-dependent which impacts the performance of search engines when trying to retrieve all relevant documents with respect to a query. For instance, the entity "United States" is written differently in different languages: "Estados Unidos" (in Spanish) and "États-Unis" (in French).
Moreover, data from different sources can contain ambiguous, complementary, and/or duplicate information about named entities. Therefore, they are often not distinctive since one single name may correspond to multiple entities. A disambiguation process is thus essential to distinguish the correct named entities to be indexed in digital libraries. In this case, a monolingual disambiguation analysis cannot disambiguate these entities in several languages for a common knowledge base.
Named Entity Linking (NEL) aims to recognize mentions in a document and link them to their corresponding entries in a Knowledge Base (KB), such as Wikipedia 1 and Freebase 2 . Most data sets for NEL are available only in English. Among them, the AIDA data set [8] is the main data set used to train NEL systems on the state of the art. Unfortunately, there are few data sets for low-resourced languages, with the notable exception of the WikiANN corpora [14].
Additionally, Cross-Lingual Named Entity Linking (XEL) considers documents that are written in a source language that is different from the target language of the KB [20]. In addition to the challenges of NEL such as multiple surface forms of a named entity [18], XEL disambiguates mentions in several languages by analyzing the spellings and contexts related to each language.
Digital libraries often contain the digitised version of old documents that are degraded due to storage conditions, handling of users and inherent vice of the material (e.g. paper naturally deteriorates over time). These problems cause numerous errors at the character and word levels in the OCR of these documents [11]. Linhares Pontes et al. [11] analyzed the impact of OCR quality on the NEL task and achieved satisfying results for NEL. They provided recommendations on the OCR quality that is required for a given level of expected NEL performance. However, their approach is monolingual, restricting the analysis and linking of entities to knowledge bases that are in the same language.
The XEL task is especially challenging due to the diversity of NEs across languages and contexts. This paper proposes a NEL crosslingual extension that can be easily adapted to any source language. Our approach uses multilingual word embeddings and a fine-tuning method to represent words and entities in multiple languages into the same dimensional space, and then to disambiguate mentions across languages. We describe an XEL system applied to and evaluated with several language pairs including English and various low-resourced languages 3 of different linguistic families such as Croatian, Finnish, Estonian, and Slovenian. We tested this approach to analyze documents and NEs in low-resourced languages and to link them to the English version of Wikipedia. We present the resulting study of this analysis and the challenges involved in the case of degraded documents from digital libraries.
The remainder of the paper is organized as follows: Section 2 makes a brief overview of the most recent and available NEL and XEL approaches in the state of the art. Section 3 details our approach to extend a monolingual NEL system for the XEL task by using multilingual word embeddings. Then, the experimental setup and evaluation are respectively described in Sections 4 and 5. Finally, conclusions and future works are set out in Section 6. . , e s }, Named Entity Linking (NEL) aims to map each mention m j i with its corresponding entity e k in the KB [18]. NEL approaches can be divided into two classes: disambiguation (they use M j as an input) and end-to-end approaches (they do not use M j as an input, but calculate it). While end-to-end approaches extract candidate entities from documents and then disambiguate them to the correct entries in a given KB [9], disambiguation approaches only disambiguate entities already recognized from documents [6,10,15].
Among the disambiguation-only approaches, Ganea and Hofmann [6] built a deep learning model for joint document-level entity disambiguation. They embed entities and words in a common vector space and use a neural attention mechanism to select words that are informative for the disambiguation decision. Then, their model collectively disambiguates the mentions in a document. We describe this approach in further details in Section 3.1. Motivated by Ganea and Hofmann's approach, Le and Titov [10] analyzed relations between mentions as latent variables in their neural NEL model. They rely on representation learning and learn embeddings of mentions, contexts, and relations to reduce the amount of human expertise required to construct the system and make the analysis more portable across domains.
In the class of end-to-end approaches, Raiman and Raiman [15] developed a system for integrating symbolic knowledge into the reasoning process of a neural network through a type system. They constrained the behavior to respect the desired symbolic structure, and automatically designed the type system without human effort. Their model first uses heuristic search or stochastic optimization over discrete variables that define a type system informed by an Oracle and a learnability heuristic. Based on a joint analysis of the named entity recognition and linking tasks, Kolitsas et al. [9] proposed an end-to-end NEL system that jointly discovers and links 3 Languages that do not have large monolingual or parallel corpora and/or manually created linguistic resources sufficient to build strong NLP statistical applications. entities in a document. They generate all possible spans (mentions) that have at least one possible entity candidate. Then, each mentioncandidate pair receives a context-aware compatibility score based on word and entity embeddings [6] coupled with neural attention and a global voting mechanism.
Extending this monolingual analysis, Cross-Lingual Named Entity Linking (XEL) analyzes documents and named entities that are in a different language than that used for the content of the knowledge base. In this context, McNamee et al. [12] proposed a XEL approach and examined the importance of transliteration, the utility of cross-language information retrieval, and the potential benefit of multilingual named entity recognition on the XEL task.
Rijhwani et al. [17] proposed a zero-shot transfer learning method for XEL. Their approach uses phonological representations and a pivot-based method, which leverages information from a highresource "pivot" language to train character-level neural entity linking models that are transferred to the source low-resource language in a zero-shot manner.
Zhou et al. [20] extensively evaluated the effect of resource restrictions on existing XEL methods in low-resource settings. They investigated a hybrid candidate generation method, combining existing lookup-based and neural candidate generation methods and proposed a set of entity disambiguation features that are entirely language-agnostic. Finally, they designed a non-linear feature combination method, which makes it possible to combine features in a more flexible way.
Differently from previous XEL approaches, we did not use any additional resources, pivot-based method or neural candidate generation to link entities across languages. Our cross-lingual extension requires only pre-trained multilingual word embeddings to disambiguate mentions in foreign languages to the English KB.

CROSS-LINGUAL ENTITY LINKING
This section describes our contribution to adapt Ganea and Hofmann's approach to the XEL task. We make a short description of Ganea and Hofmann's approach (Section 3.1), and then we detail how we extended this approach for the XEL task by using multilingual word embeddings (Section 3.2).

Ganea and Hofmann's approach
Entity Disambiguation (ED) approaches consider having already identified the named entities in the documents. In this case, these approaches aim to analyse the context of these entities to disambiguate them in a KB. In this context, Ganea and Hofmann [6] (GH) proposed a deep learning model for joint document-level entity disambiguation 4 .
They project entities and words in a common vector space, which avoids hand-engineered features, multiple disambiguation steps, or the need for additional ad-hoc heuristics when solving the ED task. Entities for each mention are locally scored based on cosine similarity with the respective document embedding. Combined with these embeddings, they proposed an attention mechanism over local context windows to select words that are informative for the disambiguation decision. The final local scores are based on the combination of the resulting context-based entity scores and a mention-entity prior. This mention-entity prior (p(e |m)) is a conditional distribution of the co-occurrence of the mention m with the entity e. In this case, GH collected mention-entity co-occurrence counts from Wikipedia to calculate this distribution.
Finally, mentions in a document are resolved jointly by using a conditional random field in conjunction with an inference scheme.

Cross-lingual extension
In order to extend GH's system to a cross-lingual setting, we made a number of modifications to their approach. Instead of using the Word2Vec embeddings, we used the pre-trained multilingual MUSE embeddings [4]. These embeddings are available in 30 languages (including Croatian, Estonian, Finnish, Slovenian, to mention a few) and they are aligned in a single vector space. Therefore, words like "house" and "talo" ("house" in Finnish) have similar word representations. One of the main goals of using these embeddings is to generate multilingual entity embeddings that can provide entity representations for mentions in several languages.
Following the idea described in [6], we collected word-entity (word w and entity e) co-occurrence counts (w, e) from the Wikipedia. These counts define a practical approximation of the above wordentity conditional distribution. These words are considered to be the "positive" distribution of entity-related words. Then, a sample of words is selected randomly to create a "negative" distribution of words that are unrelated to the entity e. The objective is to move positive word vectors closer to the embedding of the entity e and to move the vectors of random words further away from the embedding of the entity e. Therefore, we generate the entity embeddings using the English version of Wikipedia and train the GH system on the AIDA data set using the MUSE embeddings. In this scenario, GH's approach analyses English documents and links their mentions to an English KB.
Moreover, we extend the training process for some low-resourced languages by using the previous English model and continuing the training process with data in other languages. This tuning procedure optimises our model to better analyse the documents in low-resourced languages and to link their mentions to an English KB. More precisely, we initialized the weights of the neural network model with the weights of the English model, and reduced the learning rate to tune our model for the target languages. This process enables our model to adapt the analysis of words and their context for each language (e.g. the order of words and how they are combined to express a same idea in different languages).

EXPERIMENTAL SETUP
In order to analyse the impact of using multilingual embeddings on the representation of entity embeddings, we used the entity relatedness data set of Ceccarelli et al. [1] to compare the quality of entity embeddings produced by the Word2Vec and multilingual embeddings. This data set contains 3,319 and 3,673 queries for the test and validation sets. Each query consists of one target entity and up to 100 candidate entities with gold standard binary labels indicating whether the two entities are related or not. The associated task is to rank related candidate entities higher than unrelated ones. Following GH's work, we used the normalised discounted cumulative gain (NDCG) and mean average precision (MAP) measures to evaluate them. We also performed candidate ranking based on cosine similarity of entity pairs.
For training, we used the AIDA data set to train the NEL system for English using the MUSE embeddings. Then, we used the WikiANN training data set to optimise the English model for each low-resourced language (cross-validation technique). Finally, we tested our model on the WikiANN test data sets.
Following previous works, we evaluated the performance of our approach by analyzing the F1-measure. Since knowledge bases contain millions of entities, only mentions that contain a valid ground-truth entry in the KB were analysed. For mentions without corresponding entries in the KB, NEL systems have to provide a NIL entry to indicate that these mentions do not have a ground-truth entity in the KB.

EXPERIMENTAL ASSESSMENT
Entity embeddings performance: Table 1 shows the entity relatedness results using Word2Vec and MUSE embeddings for the English data set [1]. Both embeddings have the same dimensional space (300 dimensions) but different vocabulary sizes: Word2Vec (3 million tokens) and MUSE (200,000 tokens). This large difference helps Word2Vec to achieve the best results for all entity related measures. More precisely, the Word2Vec embeddings provide a better analysis of the Wikipedia documents because it has less outof-vocabulary words than the MUSE embeddings and can better represent the meaning of sentences and entities. Despite this performance drop, GH's approach using MUSE embeddings achieved better results than [19] and [13] for all metrics.  [19] 0.59 0.56 0.59 0.52 Milne and Witten [13] 0.54 0.52 0.55 0.48 NEL analysis for mono-and multilingual embeddings: Advancing our analysis of GH's system, we compared the F1-measure results for this system on English corpora using the Word2Vec and MUSE embeddings ( Table 2). As expected, the small vocabulary and lower performance in the entity relatedness measures reduced the performance of GH system over the NEL task. These factors reduced the quality of the attention and the context embeddings, and prioritised the relevance of entity priors (log p(e |m)) to disambiguate the mentions. Despite this drop, GH's system using MUSE achieved identical or very close performance for most data sets. Table 3 presents the F1-measure results for the NEL on four languages of the WikiANN corpora. The tuning process on the WikiANN data set improved the performance of GH EU-S-1: Natural Language Processing JCDL '20, August 1-5, 2020, Virtual Event, China for the WikiANN test data sets; however, these improvements were not significant. Unfortunately, the WikiANN data set is composed of short sentences with little contextual information. This characteristic makes the context analysis of GH's system less relevant and implies that the disambiguation process mainly consists in pairwise matching between mentions and entities. Another limiting factor is the small MUSE vocabulary. Indeed, a large number of out-of-vocabulary words can degrade the analysis of documents and, consequently, reduce the performance of NEL systems. XEL is a fundamental tool for search engines in digital libraries to retrieve documents with contents (including named entities) written in different languages and contexts. In this paper, we showed that Ganea and Hofmann's system using multilingual embeddings achieved satisfactory results for the English NEL task (maximal F1measure drop of 5.6%). Additionally, the tuning procedure improved the results for XEL in the low-resourced languages.

CONCLUSION
This paper is the first step to analyze the impact of multilingual embeddings to extend monolingual NEL to XEL. The next step is to investigate the impact of degraded documents on the XEL task.
Despite the small multilingual vocabulary on the word embeddings and the poor context quality of training data sets for lowresourced languages, our experiments showed a worst drop of 5.6% on F1-measure on the English test data set (and the same performance of monolingual embeddings in the best case) and a small improvement with the tuning procedure on low-resourced languages. Therefore, we intend to build training data sets on the target languages that are composed of long sentences with rich context information to improve our XEL model.
Further work is under progress to develop and analyze the performance of end-to-end XEL systems on OCRed data sets. More precisely, we want to extend the analysis of multilingual embeddings with language-agnostic features and relations between entities to provide correct predictions in different languages and overcome the problem of OCR degradation. We also intend to use larger multilingual word embeddings [2] and test the performance of these NEL systems using real data in other languages including other low-resourced languages.