Multilingual Epidemiological Text Classification: A Comparative Study

doi:10.5281/zenodo.4476039

Published January 28, 2021 | Version v1

Conference paper Open

Multilingual Epidemiological Text Classification: A Comparative Study

1. Multimedia University Kenya
2. University of La Rochelle, L3i
3. Sorbonne University France
4. Kyoto University Japan

In this paper, we approach the multilingual text classification task in the context of the epidemiological field. Multilingual text classification models tend to perform differently across different languages (low- or high-resourced), more particularly when the dataset is highly imbalanced, which is the case for epidemiological datasets. We conduct a comparative study of different machine and deep learning text classification models using a dataset comprising news articles related to epidemic outbreaks from six languages, four low-resourced and two high-resourced, in order to analyze the influence of the nature of the language, the structure of the document, and the size of the data. Our findings indicate that the performance of the models based on fine-tuned language models exceeds by more than 50% the chosen baseline models that include a specialized epidemiological news surveillance system and several machine learning models. Also, low-resource languages are highly influenced not only by the typology of the languages on which the models have been pre-trained or/and fine-tuned but also by their size. Furthermore, we discover that the beginning and the end of documents provide the most salient features for this task and, as expected, the performance of the models was proportionate to the training data size.

Files

coling_2020_multilingual_epidemiological_text_classification__a_comparative_study.pdf

Files (172.1 kB)

Name	Size	Download all
coling_2020_multilingual_epidemiological_text_classification__a_comparative_study.pdf md5:8504a4e761204effd4aeb13e2048a3e8	172.1 kB	Preview Download

Additional details

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299: European Commission

	All versions	This version
Views	59	59
Downloads	70	70
Data volume	12.2 MB	12.2 MB

Multilingual Epidemiological Text Classification: A Comparative Study

Creators

Description

Files

coling_2020_multilingual_epidemiological_text_classification__a_comparative_study.pdf

Files (172.1 kB)

Additional details

Funding