Token-level Multilingual Epidemic Dataset for Event Extraction (cid:63)

. In this paper, we present a dataset and a baseline evaluation for multilingual epidemic event extraction. We experiment with a multilingual news dataset which we annotate at the token level, a common tagging scheme utilized in event extraction systems. We approach the task of extracting epidemic events by ﬁrst detecting the relevant documents from a large collection of news reports. Then, event extraction (disease names and locations) is performed on the detected relevant documents. Preliminary experiments with the entire dataset and with ground-truth relevant documents showed promising results, while also establishing a stronger baseline for epidemiological event extraction.


Introduction
While disease surveillance has in the past been a critical component in epidemiology, conventional surveillance methods are limited in terms of both promptness and coverage, while at the same time requiring labor-intensive human input. Recently, approaches that complement the traditional surveillance methods with data-driven approaches which rely on internet-based data sources such as online news articles have been advanced [1,3]. With the progress in natural language processing (NLP), processing and analyzing news data for epidemic surveillance has become feasible. Although this research is promising, the scarcity of available annotated multilingual corpora for data-driven epidemic surveillance is a major hindrance.
Online news data contains critical information about emerging health threats such as what happened, where it happened, when, and to whom it happened [11].
This work has been supported by the European Union's Horizon 2020 research and innovation program under grants 770299 (NewsEye) and 825153 (Embeddia). It has also been supported by the French Embassy in Kenya and the French Foreign Ministry.
When processed into a structured and more meaningful form, the information can foster early detection of disease outbreaks, a critical aspect of epidemic surveillance. News reports on epidemics often originate from different parts of the world and events are likely to be reported in other languages than English. Hence, efficient multilingual approaches are necessary for effective epidemic surveillance [2].
Several works have tackled the detection of events related to epidemic diseases. For example, the Data Analysis for Information Extraction in any Language (DAnIEL) was proposed as a multilingual dataset and a news surveillance system that leverages repetition and saliency (salient zones in the structure of a news article), properties that are common in news writing [9]. Models based on neural network architectures which take advantage of the word embeddings representations have been used in monitoring social media content for health events [8]. Other methods were based on long short-term memory networks (LSTMs) [12] that approached the epidemic detection task from the perspective of classification of documents (in this case, tweets) to extract influenza-related information.
In this study, we formulate the problem of extracting the disease names and locations in the text as a sequence labeling task. We use the DAnIEL multilingual dataset (Chinese, English, French, Greek, Polish, and Russian) comprising news articles from the medical domain with diverse morphological structures. We establish a baseline performance using a specialized baseline system and experiment with the most recent neural sequence labeling architectures.

Dataset
Due to the lack of dedicated datasets for epidemic event extraction from multilingual news articles, we adapt a freely available epidemiological dataset 1 , called DAnIEL [9]. The dataset consists of news articles in six different languages, namely French, Polish, English, Chinese, Greek, and Russian. In this dataset, an epidemiological event is represented by a disease name and the location of the reported event.
However, the DAnIEL dataset is annotated at document level, which differentiates it from typical datasets (token or word level annotations) utilized in research for the event extraction task (i.e., ACE 2005 2 , TAC KBP 2014-2015 3 ). A document is either reporting an event of interest (a disease-place pair appears in a relevant document) or not (an irrelevant document).
An example of a relevant document is contained in the following sentence: Ten tuberculosis patients in India described as having an untreatable form of the lung disease may be quarantined to thwart possible spread, a health official said [. . .].
In this case, the document is annotated with Tuberculosis as the disease name, and India as the location.
We begin by performing sentence segmentation, thus obtaining the individual sentences from the text corpus. The data is then annotated using the Doccano annotation tool 4 , a collaborative annotation tool that provides annotation features for various tasks, among them sequence labelling task. The annotation guidelines required the annotators to identify and mark the spans for the key entities from the text. The occurrence of an epidemic event is characterized by mentions of disease name and the location of the disease outbreak, labeled DIS and LOC, respectively. Three native speakers annotators were recruited for each language.
The annotations were then transformed into IOB (Inside, Outside, Beginning) tagging scheme.   Table 1 presents the statistics for this dataset from which we can observe the particularities and challenges of this dataset. DAnIEL dataset is not only multilingual, but it is also imbalanced considering the low-resourced languages (Chinese, Greek, and Russian).

Experiments and Results
We first consider the specialized event extraction system, DAnIEL [9], which we consider as a strong baseline. Then, we experiment with deep learning models based on a bidirectional LSTM (BiLSTM) [7,10] that use character and word representations 5 . Additionally, due to the multilingual characteristic of the dataset, we use the multilingual BERT pre-trained language models [6] for token sequential classification and fine-tune them on our dataset. We will refer to these models as BERT-multilingual-cased 6 and BERT-multilingual-uncased 7 . We also experiment with the XLM-RoBERTa-base model [5] that has shown significant performance gains for a wide range of cross-lingual transfer tasks. We consider this model appropriate for our task and dataset due to the multilingual nature of the data 8 .
As shown in Table 2, BERT-multilingual-uncased recorded the highest F1, recall and precision scores with 80.99%, 79.77% and 82.25% respectively, on the dataset comprising both relevant and irrelevant examples. We observe in Table  2 that all the models significantly outperform our DAnIEL baseline. When evaluating the ground-truth relevant examples only, the task is obviously easier, particularly in terms of precision. Overall, XLM-RoBERTa-base attained the best F1-measure score of 89.04%. The model with the best recall was BERT-multilingual-cased (90.95%), while the BiLSTM+LSTM model had the highest precision.

Conclusions
In this paper, we present a token-level dataset and a strong baseline evaluation for multilingual epidemic event extraction. The results of the preliminary experiments suggest that the approaches based on pre-trained language models performed better than other deep learning models, thus, they can be utilized as strong baselines for epidemic event extraction. As future work, a further investigation of these preliminary results could reveal the underlying reasons of the different performance values, and thus, further work will focus on a more finegrained analysis of the methods. Moreover, we also propose to further examine the classification of relevant and irrelevant documents, in order to ascertain the level of error propagation from the document classification.