Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published March 15, 2023 | Version v1
Dataset Open

Multimodal3DIdent

Description

This upload contains the Multimodal3DIdent dataset introduced in the paper Identifiability Results for Multimodal Contrastive Learning presented at ICLR 2023. The dataset provides an identifiability benchmark with image/text pairs generated from controllable ground truth factors, some of which are shared between image and text modalities. The training, validation, and test sets contain 125000, 10000, and 10000 image/text pairs and ground truth factors, respectively. The code for the data generation is publicly available: https://github.com/imantdaunhawer/Multimodal3DIdent.
 

Description
------------------

The generated dataset contains image and text data as well as the ground truth factors of variation for each modality. Each split (train/val/test) of the dataset is structured as follows:

.
├── images
│   ├── 000000.png
│   ├── 000001.png
│   └── etc.
├── text
│   └── text_raw.txt
├── latents_image.csv
└── latents_text.csv

The directories images and text contain the generated image and text data, whereas the CSV files latents_image.csv and latents_text.csv contain the values of the respective latent factors. There is an index-wise correspondence between images, sentences, and latent factors. For example, the first line in the file text_raw.txt is the sentence that corresponds to the first image in the images directory.

Latent factors: We use the following ground truth latent factors to generate image and text data. Each factor is sampled from a uniform distribution defined on the specified set of values for the respective factor.

Modality Latent Factor Values Details
Image Object shape {0, 1, ..., 6} Mapped to Blender shapes like "Teapot", "Hare", etc.
Image Object x-position {0, 1, 2} Mapped to {-3, 0, 3} for Blender
Image Object y-position {0, 1, 2} Mapped to {-3, 0, 3} for Blender
Image Object z-position {0} Constant
Image Object alpha-rotation [0, 1]-interval Linearly transformed to [-pi/2, pi/2] for Blender
Image Object beta-rotation [0, 1]-interval Linearly transformed to [-pi/2, pi/2] for Blender
Image Object gamma-rotation [0, 1]-interval Linearly transformed to [-pi/2, pi/2] for Blender
Image Object color [0, 1]-interval Hue value in HSV transformed to RGB for Blender
Image Spotlight position [0, 1]-interval Transformed to a unique position on a semicircle
Image Spotlight color [0, 1]-interval Hue value in HSV transformed to RGB for Blender
Image Background color [0, 1]-interval Hue value in HSV transformed to RGB for Blender
Text Object shape {0, 1, ..., 6} Mapped to strings like "teapot", "hare", etc.
Text Object x-position {0, 1, 2} Mapped to strings "left", "center", "right"
Text Object y-position {0, 1, 2} Mapped to strings "top", "mid", "bottom"
Text Object color string values Color names from 3 different color palettes
Text Text phrasing {0, 1, ..., 4} Mapped to 5 different English sentences


Image rendering: We use the Blender rendering engine to create visually complex images depicting a 3D scene. Each image in the dataset shows a colored 3D object of a certain shape or class (i.e., teapot, hare, cow, armadillo, dragon, horse, or head) in front of a colored background and illuminated by a colored spotlight that is focused on the object and located on a semicircle above the scene. The resulting RGB images are of size 224 x 224 x 3.

Text generation: We generate a short sentence describing the respective scene. Each sentence describes the object's shape or class (e.g., teapot), position (e.g., bottom-left), and color. The color is represented in a human-readable form (e.g., "lawngreen", "xkcd:bright aqua", etc.) as the name of the color (from a randomly sampled palette) that is closest to the sampled color value in RGB space. The sentence is constructed from one of five pre-configured phrases with placeholders for the respective ground truth factors.

Relation between modalities: Three latent factors (object shape, x-position, y-position) are shared between image/text pairs. The object color also exhibits a dependence between modalities; however, it is not a 1-to-1 correspondence because the color palette is sampled randomly from a set of multiple palettes. Additionally, there is a causal dependence of object color on object x-position since the range of hue values [0, 1] is split into three equally sized intervals, each of which is associated with a fixed x-position of the object. For instance, if x-position is “left”, we sample the hue value from the interval [0, 1/3]. Consequently, the color of the object can be predicted to some degree from the object's position.

 

Acknowledgements
-------------------------------

The Multimodal3DIdent dataset builds on the following resources:
- 3DIdent dataset
- Causal3DIdent dataset
- CLEVR dataset
- Blender open-source 3D creation suite

Files

Files (4.1 GB)

Name Size Download all
md5:565f25be244338e0f702084aa7e5d382
4.1 GB Download

Additional details

Related works

Is supplement to
10.48550/arXiv.2303.09166 (DOI)