Dataset Open Access

MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

Armitage, Jason; Kacupaj, Endri; Tahmasebzadeh, Golsa; Swati

DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="" xmlns="" xsi:schemaLocation="">
  <identifier identifierType="DOI">10.5281/zenodo.3885753</identifier>
      <creatorName>Armitage, Jason</creatorName>
      <affiliation>University of Bonn, Germany</affiliation>
      <creatorName>Kacupaj, Endri</creatorName>
      <affiliation>University of Bonn, Germany</affiliation>
      <creatorName>Tahmasebzadeh, Golsa</creatorName>
      <affiliation>TIB – Leibniz InformationCenter for Science andTechnology, Germany</affiliation>
      <affiliation>Jožef Stefan Institute, Slovenia</affiliation>
    <title>MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities</title>
    <subject>Machine Learning</subject>
    <subject>Multitask learning</subject>
    <subject>Multimodal data</subject>
    <subject>Multilingual data</subject>
    <date dateType="Issued">2020-06-08</date>
  <resourceType resourceTypeGeneral="Dataset"/>
    <alternateIdentifier alternateIdentifierType="url"></alternateIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.3885752</relatedIdentifier>
  <version>version 1.0.0</version>
    <rights rightsURI="">Creative Commons Attribution 4.0 International</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
    <description descriptionType="Abstract">&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We introduce the &lt;strong&gt;MLM (Multiple Languages and Modalities)&lt;/strong&gt; dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. The second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single-task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalizing on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding.&amp;nbsp;&lt;/p&gt;

Multiple Languages and Modalities comprises data points on 236k human settlements for evaluating and optimizing multitask learning systems. MLM presents a dataset with a high level of diversity in terms of modality and language. For each entity, we have extracted text summaries, images, coordinates, and their respective triple classes. Text summaries are available in three languages (English, French, and German) with each entity having between one and three language entries.&amp;nbsp;&lt;/p&gt;

&lt;p&gt;Human settlements from all continents are provided in the overall dataset (MLM) with 72% located in Europe. Two further versions of the dataset - MLM-irle and MLM-irle-gr - were generated for use in the benchmark evaluation for multitask systems described in the paper (see above).&amp;nbsp; MLM-irle-gr (ie geo-representative) was generated to serve organizations that focus on the European Union by providing a geographically balanced coverage of human settlements in this region. MLM-irle-gr contains data on 24k human settlements across the EU weighted in relation to the population count for each of the 28 countries.&lt;/p&gt;

&lt;p&gt;MLM contains the following fields:&lt;/p&gt;

#	field-label	description	
1.	id		a unique identifier
2.	label		textual label
3.	coordinates	longitude, latitude geo-location value
4.	summaries	list of textual summaries related to the entity
5.	images		list of images related to the entity
6.	classes		list of associated triple class

&lt;p&gt;MLM - Details by Dataset Version:&lt;/p&gt;

Num. of		     MLM   	  MLM-irle   MLM-irle-gr
Entities	     236496	  218681     22501
Images		     412422	  314533     31621
Summaries	     497899	  462328     47508
Triple classes       1685	  1655       452


&lt;p&gt;All three versions of MLM listed in the table directly above are available for direct download and use.&amp;nbsp;To support findability and sustainability, the MLM dataset is published as an on-line resource at&lt;em&gt; &lt;a href=""&gt;;/a&gt;&lt;/em&gt;. &amp;nbsp;A separate page with detailed explanations and illustrations is available at &lt;em&gt;&lt;a href=""&gt;;/a&gt; &lt;/em&gt;to promote ease-of-use. The project GitHub repository contains the complete source code for the system and the generation script is available at &lt;em&gt;&lt;a href=""&gt;;/a&gt;&lt;/em&gt;. Documentation adheres to the standards of &lt;em&gt;FAIR Data principles&lt;/em&gt; with all relevant metadata specified to the research community and users. It is freely accessible under the Creative Commons Attribution 4.0 International license, which makes it reusable for almost any purpose.&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Updating and Reusability:&lt;/strong&gt;&lt;br&gt;
MLM is supported by a team of researchers from the University of Bonn, the Leibniz Information Center for Science and Technology, and Jožef Stefan Institute. The resource is already in use for individual projects and as a contribution to the project deliverables of the Marie Skłodowska-Curie CLEOPATRA Innovative Training Network. In addition to the steps above that make the resource available to the wider community, the usage of MLM will be promoted to the network of researchers in this project. Use among researchers and practitioners in digital humanities will be promoted by demonstrations and presentations at domain-related events. Activities are planned for the Digital Methods Summer School run by the University of Amsterdam. The range of modalities and languages present in the dataset also extend its application to research on multimodal representation learning, multilingual machine learning, information retrieval, location estimation, and the Semantic Web. MLM will be supported and maintained for three years in the first instance. A second release of the dataset is already scheduled and the generation process outlined above is designed to enable rapid scaling.&lt;/p&gt;</description>
All versions This version
Views 191191
Downloads 156156
Data volume 4.1 GB4.1 GB
Unique views 141141
Unique downloads 3939


Cite as