Published June 8, 2020 | Version version 1.0.0
Dataset Open

MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

  • 1. University of Bonn, Germany
  • 2. TIB – Leibniz InformationCenter for Science andTechnology, Germany
  • 3. Jožef Stefan Institute, Slovenia

Description

Abstract:

We introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. The second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single-task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalizing on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding. 

Introduction:
Multiple Languages and Modalities comprises data points on 236k human settlements for evaluating and optimizing multitask learning systems. MLM presents a dataset with a high level of diversity in terms of modality and language. For each entity, we have extracted text summaries, images, coordinates, and their respective triple classes. Text summaries are available in three languages (English, French, and German) with each entity having between one and three language entries. 

Human settlements from all continents are provided in the overall dataset (MLM) with 72% located in Europe. Two further versions of the dataset - MLM-irle and MLM-irle-gr - were generated for use in the benchmark evaluation for multitask systems described in the paper (see above).  MLM-irle-gr (ie geo-representative) was generated to serve organizations that focus on the European Union by providing a geographically balanced coverage of human settlements in this region. MLM-irle-gr contains data on 24k human settlements across the EU weighted in relation to the population count for each of the 28 countries.

MLM contains the following fields:

----------------------------------------------------------------------
#	field-label	description	
----------------------------------------------------------------------
1.	id		a unique identifier
2.	label		textual label
3.	coordinates	longitude, latitude geo-location value
4.	summaries	list of textual summaries related to the entity
5.	images		list of images related to the entity
6.	classes		list of associated triple class
----------------------------------------------------------------------

MLM - Details by Dataset Version:

-----------------------------------------------------------
Num. of		     MLM   	  MLM-irle   MLM-irle-gr
-----------------------------------------------------------
Entities	     236496	  218681     22501
Images		     412422	  314533     31621
Summaries	     497899	  462328     47508
Triple classes       1685	  1655       452
-----------------------------------------------------------

Availability:

All three versions of MLM listed in the table directly above are available for direct download and use. To support findability and sustainability, the MLM dataset is published as an on-line resource at https://doi.org/10.5281/zenodo.3885753.  A separate page with detailed explanations and illustrations is available at http://cleopatra.ijs.si/goal-mlm/ to promote ease-of-use. The project GitHub repository contains the complete source code for the system and the generation script is available at https://github.com/GOALCLEOPATRA/MLM. Documentation adheres to the standards of FAIR Data principles with all relevant metadata specified to the research community and users. It is freely accessible under the Creative Commons Attribution 4.0 International license, which makes it reusable for almost any purpose. 

Updating and Reusability:
MLM is supported by a team of researchers from the University of Bonn, the Leibniz Information Center for Science and Technology, and Jožef Stefan Institute. The resource is already in use for individual projects and as a contribution to the project deliverables of the Marie Skłodowska-Curie CLEOPATRA Innovative Training Network. In addition to the steps above that make the resource available to the wider community, the usage of MLM will be promoted to the network of researchers in this project. Use among researchers and practitioners in digital humanities will be promoted by demonstrations and presentations at domain-related events. Activities are planned for the Digital Methods Summer School run by the University of Amsterdam. The range of modalities and languages present in the dataset also extend its application to research on multimodal representation learning, multilingual machine learning, information retrieval, location estimation, and the Semantic Web. MLM will be supported and maintained for three years in the first instance. A second release of the dataset is already scheduled and the generation process outlined above is designed to enable rapid scaling.

Files

data-description.txt

Files (163.9 MB)

Name Size Download all
md5:9bb879b19722ecdaefcb8fb45797eda0
7.7 kB Preview Download
md5:621e2fbabf34da19131f1ea19fec6635
1.9 kB Preview Download
md5:4d93bb491ae59ad7ec28d590045966ee
3.5 kB Preview Download
md5:32d38cb974226bccc92ed036061dbfb2
77.0 MB Preview Download
md5:4f8d504c247bc97576997c48e70c3dc5
79.1 MB Preview Download
md5:91f4aced4c8d7bb7d5699d94f784bdbb
7.9 MB Preview Download
md5:d377cf05eb19d29628496c74e561d856
439 Bytes Preview Download