MultiSubs: A Large-scale Multimodal and Multilingual Dataset

Wang, Josiah

doi:10.5281/zenodo.5034605

Published June 30, 2021 | Version 2021.06

Dataset Open

MultiSubs: A Large-scale Multimodal and Multilingual Dataset

Wang, Josiah¹

1. Imperial College London

MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach.

Please refer to our paper for a more detailed description of the dataset:

Josiah Wang, Pranava Madhyastha, Josiel Figueiredo, Chiraag Lala, Lucia Specia (2021). MultiSubs: A Large-scale Multimodal and Multilingual Dataset. CoRR, abs/2103.01910. Available at: https://arxiv.org/abs/2103.01910

Files

multisubs_data.zip

Files (4.6 GB)

Name	Size	Download all
multisubs_data.zip md5:ba58ac8957322953993c75dcef19965e	2.5 GB	Preview Download
multisubs_images.zip md5:5c1b61390c80f98821130dde179058eb	2.2 GB	Preview Download

Additional details

European Commission
MultiMT – Multi-modal Context Modelling for Machine Translation 678017

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	1,008	987
Downloads	126	125
Data volume	428.2 GB	426.0 GB

MultiSubs: A Large-scale Multimodal and Multilingual Dataset

Creators

Description

Files

multisubs_data.zip

Files (4.6 GB)

Additional details

Funding