MultiSubs: A Large-scale Multimodal and Multilingual Dataset
Description
MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach.
Please refer to our paper for a more detailed description of the dataset:
Josiah Wang, Pranava Madhyastha, Josiel Figueiredo, Chiraag Lala, Lucia Specia (2021). MultiSubs: A Large-scale Multimodal and Multilingual Dataset. CoRR, abs/2103.01910. Available at: https://arxiv.org/abs/2103.01910
Files
multisubs_data.zip
Files
(4.6 GB)
Name | Size | Download all |
---|---|---|
md5:ba58ac8957322953993c75dcef19965e
|
2.5 GB | Preview Download |
md5:5c1b61390c80f98821130dde179058eb
|
2.2 GB | Preview Download |