Multimodal3DIdent

Imant Daunhawer; Alice Bizeul; Emanuele Palumbo; Alexander Marx; Julia E. Vogt

doi:10.5281/zenodo.7678231

Published March 15, 2023 | Version v1

Dataset Open

Multimodal3DIdent

1. ETH Zurich

This upload contains the Multimodal3DIdent dataset introduced in the paper Identifiability Results for Multimodal Contrastive Learning presented at ICLR 2023. The dataset provides an identifiability benchmark with image/text pairs generated from controllable ground truth factors, some of which are shared between image and text modalities. The training, validation, and test sets contain 125000, 10000, and 10000 image/text pairs and ground truth factors, respectively. The code for the data generation is publicly available: https://github.com/imantdaunhawer/Multimodal3DIdent.

Description
------------------

The generated dataset contains image and text data as well as the ground truth factors of variation for each modality. Each split (train/val/test) of the dataset is structured as follows:

.
├── images
│   ├── 000000.png
│   ├── 000001.png
│   └── etc.
├── text
│   └── text_raw.txt
├── latents_image.csv
└── latents_text.csv

The directories images and text contain the generated image and text data, whereas the CSV files latents_image.csv and latents_text.csv contain the values of the respective latent factors. There is an index-wise correspondence between images, sentences, and latent factors. For example, the first line in the file text_raw.txt is the sentence that corresponds to the first image in the images directory.

Latent factors: We use the following ground truth latent factors to generate image and text data. Each factor is sampled from a uniform distribution defined on the specified set of values for the respective factor.

Modality	Latent Factor	Values	Details
Image	Object shape	{0, 1, ..., 6}	Mapped to Blender shapes like "Teapot", "Hare", etc.
Image	Object x-position	{0, 1, 2}	Mapped to {-3, 0, 3} for Blender
Image	Object y-position	{0, 1, 2}	Mapped to {-3, 0, 3} for Blender
Image	Object z-position	{0}	Constant
Image	Object alpha-rotation	[0, 1]-interval	Linearly transformed to [-pi/2, pi/2] for Blender
Image	Object beta-rotation	[0, 1]-interval	Linearly transformed to [-pi/2, pi/2] for Blender
Image	Object gamma-rotation	[0, 1]-interval	Linearly transformed to [-pi/2, pi/2] for Blender
Image	Object color	[0, 1]-interval	Hue value in HSV transformed to RGB for Blender
Image	Spotlight position	[0, 1]-interval	Transformed to a unique position on a semicircle
Image	Spotlight color	[0, 1]-interval	Hue value in HSV transformed to RGB for Blender
Image	Background color	[0, 1]-interval	Hue value in HSV transformed to RGB for Blender
Text	Object shape	{0, 1, ..., 6}	Mapped to strings like "teapot", "hare", etc.
Text	Object x-position	{0, 1, 2}	Mapped to strings "left", "center", "right"
Text	Object y-position	{0, 1, 2}	Mapped to strings "top", "mid", "bottom"
Text	Object color	string values	Color names from 3 different color palettes
Text	Text phrasing	{0, 1, ..., 4}	Mapped to 5 different English sentences

Image rendering: We use the Blender rendering engine to create visually complex images depicting a 3D scene. Each image in the dataset shows a colored 3D object of a certain shape or class (i.e., teapot, hare, cow, armadillo, dragon, horse, or head) in front of a colored background and illuminated by a colored spotlight that is focused on the object and located on a semicircle above the scene. The resulting RGB images are of size 224 x 224 x 3.

Text generation: We generate a short sentence describing the respective scene. Each sentence describes the object's shape or class (e.g., teapot), position (e.g., bottom-left), and color. The color is represented in a human-readable form (e.g., "lawngreen", "xkcd:bright aqua", etc.) as the name of the color (from a randomly sampled palette) that is closest to the sampled color value in RGB space. The sentence is constructed from one of five pre-configured phrases with placeholders for the respective ground truth factors.

Relation between modalities: Three latent factors (object shape, x-position, y-position) are shared between image/text pairs. The object color also exhibits a dependence between modalities; however, it is not a 1-to-1 correspondence because the color palette is sampled randomly from a set of multiple palettes. Additionally, there is a causal dependence of object color on object x-position since the range of hue values [0, 1] is split into three equally sized intervals, each of which is associated with a fixed x-position of the object. For instance, if x-position is “left”, we sample the hue value from the interval [0, 1/3]. Consequently, the color of the object can be predicted to some degree from the object's position.

Acknowledgements
-------------------------------

The Multimodal3DIdent dataset builds on the following resources:
- 3DIdent dataset
- Causal3DIdent dataset
- CLEVR dataset
- Blender open-source 3D creation suite

Files

Files (4.1 GB)

Name	Size	Download all
m3di.tar.gz md5:565f25be244338e0f702084aa7e5d382	4.1 GB	Download

Additional details

Is supplement to: 10.48550/arXiv.2303.09166 (DOI)

	All versions	This version
Views	164	163
Downloads	105	105
Data volume	624.7 GB	624.7 GB

Multimodal3DIdent

Creators

Description

Files

Files (4.1 GB)

Additional details

Related works