Extended datasets from MM-IMDB and Ads-Parallelity dataset with the features from Google Cloud Vision API
- 1. Hosei University
- 2. CyberAgent, Inc.
Description
This is extended datasets from MM-IMDB [Arevalo+ ICLRW'17], Ads-Parallelity [Zhang+ BMVC'18] dataset with the features from Google Cloud Vision API. These datasets are stored in jsonl (JSON Lines) format.
Abstract (from our paper):
There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM2S2). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results.
Dataset (MM-IMDB and Ads-Parallelity):
We extended two multimodal datasets, namely, MM-IMDB [Arevalo+ ICLRW'17], Ads-Parallelity [Zhang+ BMVC'18] for the empirical experiments. The MM-IMDB dataset contains 25,925 movies with multiple labels (genres). We used the original split provided in the dataset and reported the F1 scores (micro, macro, and samples) of the test set. The Ads-Parallelity dataset contains 670 images and slogans from persuasive advertisements to understand the implicit relationship (parallel and non-parallel) between these two modalities. A binary classification task is used to predict whether the text and image in the same ad convey the same message.
We transformed the following multimodal information (i.e., visual, textual, and categorical data) into textual tokens and fed these into our proposed model. We used the Google Cloud Vision API for the visual features to obtain the following four pieces of information as tokens: (1) text from the OCR, (2) category labels from the label detection, (3) object tags from the object detection, and (4) the number of faces from the facial detection. We input the labels and object detection results as a sequence in order of confidence, as obtained from the API. We describe the visual, textual, and categorical features of each dataset below.
MM-IMDB: We used the title and plot of movies as the textual features, and the aforementioned API results based on poster images as visual features.
Ads-Parallelity: We used the same API-based visual features as in MM-IMDB. Furthermore, we used textual and categorical features consisting of textual inputs of transcriptions and messages, and categorical inputs of natural and text concrete images.
Notes
Files
Files
(3.1 GB)
Name | Size | Download all |
---|---|---|
md5:2f8d036e900f5e406b4743006d838e4d
|
60.8 MB | Download |
md5:f731774397051bb9da040f39bd0191cd
|
3.1 GB | Download |
Additional details
Related works
- Is cited by
- Journal article: 10.1109/ACCESS.2022.3221812 (DOI)
- Is supplement to
- Preprint: 10.48550/arXiv.2209.03126 (DOI)
References
- Arevalo, et al. "Gated Multimodal Units for Information Fusion." In Proc. of ICLR Workshop. 2017.
- Zhang, et al. "Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text." In Proc. of BMVC. 2018.
- Kitada, et al. "DM2S2: Deep Multimodal Sequence Sets With Hierarchical Modality Attention," in IEEE Access, vol. 10, pp. 120023-120034, 2022, doi: 10.1109/ACCESS.2022.3221812.