Multimodal3DIdent
- 1. ETH Zurich
Description
This upload contains the Multimodal3DIdent dataset introduced in the paper Identifiability Results for Multimodal Contrastive Learning presented at ICLR 2023. The dataset provides an identifiability benchmark with image/text pairs generated from controllable ground truth factors, some of which are shared between image and text modalities. The training, validation, and test sets contain 125000, 10000, and 10000 image/text pairs and ground truth factors, respectively. The code for the data generation is publicly available: https://github.com/imantdaunhawer/Multimodal3DIdent.
Description
------------------
The generated dataset contains image and text data as well as the ground truth factors of variation for each modality. Each split (train/val/test) of the dataset is structured as follows:
.
├── images
│ ├── 000000.png
│ ├── 000001.png
│ └── etc.
├── text
│ └── text_raw.txt
├── latents_image.csv
└── latents_text.csv
The directories images
and text
contain the generated image and text data, whereas the CSV files latents_image.csv
and latents_text.csv
contain the values of the respective latent factors. There is an index-wise correspondence between images, sentences, and latent factors. For example, the first line in the file text_raw.txt
is the sentence that corresponds to the first image in the images
directory.
Latent factors: We use the following ground truth latent factors to generate image and text data. Each factor is sampled from a uniform distribution defined on the specified set of values for the respective factor.
Modality | Latent Factor | Values | Details |
---|---|---|---|
Image | Object shape | {0, 1, ..., 6} | Mapped to Blender shapes like "Teapot", "Hare", etc. |
Image | Object x-position | {0, 1, 2} | Mapped to {-3, 0, 3} for Blender |
Image | Object y-position | {0, 1, 2} | Mapped to {-3, 0, 3} for Blender |
Image | Object z-position | {0} | Constant |
Image | Object alpha-rotation | [0, 1]-interval | Linearly transformed to [-pi/2, pi/2] for Blender |
Image | Object beta-rotation | [0, 1]-interval | Linearly transformed to [-pi/2, pi/2] for Blender |
Image | Object gamma-rotation | [0, 1]-interval | Linearly transformed to [-pi/2, pi/2] for Blender |
Image | Object color | [0, 1]-interval | Hue value in HSV transformed to RGB for Blender |
Image | Spotlight position | [0, 1]-interval | Transformed to a unique position on a semicircle |
Image | Spotlight color | [0, 1]-interval | Hue value in HSV transformed to RGB for Blender |
Image | Background color | [0, 1]-interval | Hue value in HSV transformed to RGB for Blender |
Text | Object shape | {0, 1, ..., 6} | Mapped to strings like "teapot", "hare", etc. |
Text | Object x-position | {0, 1, 2} | Mapped to strings "left", "center", "right" |
Text | Object y-position | {0, 1, 2} | Mapped to strings "top", "mid", "bottom" |
Text | Object color | string values | Color names from 3 different color palettes |
Text | Text phrasing | {0, 1, ..., 4} | Mapped to 5 different English sentences |
Image rendering: We use the Blender rendering engine to create visually complex images depicting a 3D scene. Each image in the dataset shows a colored 3D object of a certain shape or class (i.e., teapot, hare, cow, armadillo, dragon, horse, or head) in front of a colored background and illuminated by a colored spotlight that is focused on the object and located on a semicircle above the scene. The resulting RGB images are of size 224 x 224 x 3.
Text generation: We generate a short sentence describing the respective scene. Each sentence describes the object's shape or class (e.g., teapot), position (e.g., bottom-left), and color. The color is represented in a human-readable form (e.g., "lawngreen", "xkcd:bright aqua", etc.) as the name of the color (from a randomly sampled palette) that is closest to the sampled color value in RGB space. The sentence is constructed from one of five pre-configured phrases with placeholders for the respective ground truth factors.
Relation between modalities: Three latent factors (object shape, x-position, y-position) are shared between image/text pairs. The object color also exhibits a dependence between modalities; however, it is not a 1-to-1 correspondence because the color palette is sampled randomly from a set of multiple palettes. Additionally, there is a causal dependence of object color on object x-position since the range of hue values [0, 1] is split into three equally sized intervals, each of which is associated with a fixed x-position of the object. For instance, if x-position is “left”, we sample the hue value from the interval [0, 1/3]. Consequently, the color of the object can be predicted to some degree from the object's position.
Acknowledgements
-------------------------------
The Multimodal3DIdent dataset builds on the following resources:
- 3DIdent dataset
- Causal3DIdent dataset
- CLEVR dataset
- Blender open-source 3D creation suite
Files
Files
(4.1 GB)
Name | Size | Download all |
---|---|---|
md5:565f25be244338e0f702084aa7e5d382
|
4.1 GB | Download |
Additional details
Related works
- Is supplement to
- 10.48550/arXiv.2303.09166 (DOI)