Published June 30, 2021 | Version 2021.06
Dataset Open

MultiSubs: A Large-scale Multimodal and Multilingual Dataset

  • 1. Imperial College London

Description

MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. 

Please refer to our paper for a more detailed description of the dataset:

Josiah Wang, Pranava Madhyastha, Josiel Figueiredo, Chiraag Lala, Lucia Specia (2021). MultiSubs: A Large-scale Multimodal and Multilingual Dataset. CoRR, abs/2103.01910. Available at: https://arxiv.org/abs/2103.01910

Files

multisubs_data.zip

Files (4.6 GB)

Name Size Download all
md5:ba58ac8957322953993c75dcef19965e
2.5 GB Preview Download
md5:5c1b61390c80f98821130dde179058eb
2.2 GB Preview Download

Additional details

Funding

European Commission
MultiMT – Multi-modal Context Modelling for Machine Translation 678017