Dataset Open Access

MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

Armitage, Jason; Kacupaj, Endri; Tahmasebzadeh, Golsa; Swati


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.3885753", 
  "language": "eng", 
  "title": "MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities", 
  "issued": {
    "date-parts": [
      [
        2020, 
        6, 
        8
      ]
    ]
  }, 
  "abstract": "<p><strong>Abstract:</strong></p>\n\n<p>We introduce the <strong>MLM (Multiple Languages and Modalities)</strong> dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. The second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single-task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalizing on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding.&nbsp;</p>\n\n<p><strong>Introduction:</strong><br>\nMultiple Languages and Modalities comprises data points on 236k human settlements for evaluating and optimizing multitask learning systems. MLM presents a dataset with a high level of diversity in terms of modality and language. For each entity, we have extracted text summaries, images, coordinates, and their respective triple classes. Text summaries are available in three languages (English, French, and German) with each entity having between one and three language entries.&nbsp;</p>\n\n<p>Human settlements from all continents are provided in the overall dataset (MLM) with 72% located in Europe. Two further versions of the dataset - MLM-irle and MLM-irle-gr - were generated for use in the benchmark evaluation for multitask systems described in the paper (see above).&nbsp; MLM-irle-gr (ie geo-representative) was generated to serve organizations that focus on the European Union by providing a geographically balanced coverage of human settlements in this region. MLM-irle-gr contains data on 24k human settlements across the EU weighted in relation to the population count for each of the 28 countries.</p>\n\n<p>MLM contains the following fields:</p>\n\n<pre><code>----------------------------------------------------------------------\n#\tfield-label\tdescription\t\n----------------------------------------------------------------------\n1.\tid\t\ta unique identifier\n2.\tlabel\t\ttextual label\n3.\tcoordinates\tlongitude, latitude geo-location value\n4.\tsummaries\tlist of textual summaries related to the entity\n5.\timages\t\tlist of images related to the entity\n6.\tclasses\t\tlist of associated triple class\n----------------------------------------------------------------------</code></pre>\n\n<p>MLM - Details by Dataset Version:</p>\n\n<pre><code>-----------------------------------------------------------\nNum. of\t\t     MLM   \t  MLM-irle   MLM-irle-gr\n-----------------------------------------------------------\nEntities\t     236496\t  218681     22501\nImages\t\t     412422\t  314533     31621\nSummaries\t     497899\t  462328     47508\nTriple classes       1685\t  1655       452\n-----------------------------------------------------------</code></pre>\n\n<p><strong>Availability:</strong></p>\n\n<p>All three versions of MLM listed in the table directly above are available for direct download and use.&nbsp;To support findability and sustainability, the MLM dataset is published as an on-line resource at<em> <a href=\"https://doi.org/10.5281/zenodo.3885753\">https://doi.org/10.5281/zenodo.3885753</a></em>. &nbsp;A separate page with detailed explanations and illustrations is available at <em><a href=\"http://cleopatra.ijs.si/goal-mlm/\">http://cleopatra.ijs.si/goal-mlm/</a> </em>to promote ease-of-use. The project GitHub repository contains the complete source code for the system and the generation script is available at <em><a href=\"http://github.com/GOALCLEOPATRA/MLM\">https://github.com/GOALCLEOPATRA/MLM</a></em>. Documentation adheres to the standards of <em>FAIR Data principles</em> with all relevant metadata specified to the research community and users. It is freely accessible under the Creative Commons Attribution 4.0 International license, which makes it reusable for almost any purpose.&nbsp;</p>\n\n<p><strong>Updating and Reusability:</strong><br>\nMLM is supported by a team of researchers from the University of Bonn, the Leibniz Information Center for Science and Technology, and Jo\u017eef Stefan Institute. The resource is already in use for individual projects and as a contribution to the project deliverables of the Marie Sk\u0142odowska-Curie CLEOPATRA Innovative Training Network. In addition to the steps above that make the resource available to the wider community, the usage of MLM will be promoted to the network of researchers in this project. Use among researchers and practitioners in digital humanities will be promoted by demonstrations and presentations at domain-related events. Activities are planned for the Digital Methods Summer School run by the University of Amsterdam. The range of modalities and languages present in the dataset also extend its application to research on multimodal representation learning, multilingual machine learning, information retrieval, location estimation, and the Semantic Web. MLM will be supported and maintained for three years in the first instance. A second release of the dataset is already scheduled and the generation process outlined above is designed to enable rapid scaling.</p>", 
  "author": [
    {
      "family": "Armitage, Jason"
    }, 
    {
      "family": "Kacupaj, Endri"
    }, 
    {
      "family": "Tahmasebzadeh, Golsa"
    }, 
    {
      "family": "Swati"
    }
  ], 
  "version": "version 1.0.0", 
  "type": "dataset", 
  "id": "3885753"
}
191
156
views
downloads
All versions This version
Views 191191
Downloads 156156
Data volume 4.1 GB4.1 GB
Unique views 141141
Unique downloads 3939

Share

Cite as