MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- We introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications for digital humanities projects that perform multiple tasks on data encountered on the web and in digital archives. The second version of MLM provides a geo-related subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource for digital humanities applications with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single-task systems on the full and geo-related versions of MLM demonstrate the challenges of generalizing on diverse data. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Introduction: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The Multiple Languages and Modalities comprises data points on 236k human settlements for evaluating and optimising multitask learning systems. MLM presents a dataset with a high level of diversity in terms of modality and language. For each entity, we have extracted text summaries, images, coordinates, and their respective triple classes. Text summaries are available in three languages (English, French, and German) with each entity having between one and three language entries. Human settlements from all 7 continents are provided in the full dataset with 72% located in Europe. To serve organisations that focus on the European Union, we have created a second version of the dataset - MLM-GR (MLM Geo-Related) - that provides a geographically balanced coverage of human settlements in this region. MLM-GR contains data on 24k human settlements across the EU weighted in relation to the population count for each of the 28 countries. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ A sample example: -------------------------------------------------------------------------------------------------------------------------------------------------- { "id": 2010591, "label": "Oberottersbach", "coordinates": [ "50.7986", "7.47556" ], "summaries": [ { "de": "Oberottersbach ist ein Ortsteil der Gemeinde Eitorf. Er liegt direkt oberhalb von Mittelottersbach im Ottersbachtal." } ], "images": [ "Q2010591_0.jpg" ], "classes": [ "Ortsteil" ] } -------------------------------------------------------------------------------------------------------------------------------------------------- The dataset contains the following fields: -------------------------------------------------------------------------------------------------------------------------- # field-label description -------------------------------------------------------------------------------------------------------------------------- 1. id a unique identifer 2. label textual label 3. coordinates longitude, latitude geo-location value 4. summaries list of textual summary to describe the entity in one or more of the (EN, DE, FR) language(s) 5. images list of images related to the entity 6. classes list of associated triple class -------------------------------------------------------------------------------------------------------------------------- MLM - Dataset Details ------------------------------------------------------------- Num. of MLM MLM-irle MLM-irle-gr ------------------------------------------------------------- Entities 236496 218681 22501 Images 412422 314533 31621 Summaries 497899 462328 47508 Classes 1685 1655 452 ------------------------------------------------------------ Availability: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Versions of MLM listed in the above Table are available for direct download and use. To support findability and sustainability, the MLM dataset is published as an on-line resource at https:/zenodo.org/record/3862932. A separate page with detailed explanations and illustrations is available at http://cleopatra.ijs.si/goal-mlm/ to promote ease-of-use. The project GitHub repository contains the complete source code for the system and generation script is available at https:/github.com/GOALCLEOPATRA/MLM. Documentation adheres to the standards of FAIR Data principles with all relevant metadata specified to the research community and users. It is freely accessible under the Creative Commons Attribution 4.0 International license, which makes it reusable for almost any purpose. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Updating and Reusability: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ MLM is supported by a team of researchers from the University of Bonn, the Leibniz Information Center for Science and Technology, and Jožef Stefan Institute. The resource is already in use for individual projects and as a contribution to the project deliverables of the Marie Skłodowska-Curie CLEOPATRA Innovative Training Network. In addition to the steps above that make the resource available to the wider community, usage of MLM will be promoted to the network of researchers in this project. Use among researchers and practitioners in digital humanities will be promoted by demonstrations and presentations at domain-related events. Activities are planned for the Digital Methods Summer School run by the University of Amsterdam. The range of modalities and languages present in the dataset also extend its application to research on multimodal representation learning, multilingual machine learning, information retrieval, location estimation, and the Semantic Web. MLM will be supported and maintained for three years in the first instance. A second release of the dataset is already scheduled and the generation process outlined above is designed to enable rapid scaling. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------